Loading learning content...
In the previous page, we constructed the likelihood function for logistic regression:
$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \pi_i^{y_i} (1 - \pi_i)^{1-y_i}$$
This expression is mathematically elegant but computationally treacherous. When you multiply hundreds or thousands of probabilities—each between 0 and 1—the result rapidly approaches zero. A typical dataset might yield a likelihood of $10^{-500}$ or smaller, far below the smallest representable floating-point number ($\approx 10^{-308}$ for double precision).
But the solution is beautifully simple: take logarithms. The logarithm transforms products into sums, preserves the location of maxima, and converts tiny positive numbers into manageable negative numbers. This single transformation makes maximum likelihood estimation computationally feasible and mathematically elegant.
More profoundly, the log-likelihood reveals structural properties hidden in the likelihood—most importantly, concavity. This concavity guarantees that optimization algorithms will find the global maximum, not get stuck in local optima.
By the end of this page, you will understand why the log-likelihood is essential for computation, derive the log-likelihood function for logistic regression step by step, prove its concavity rigorously, and appreciate how this concavity enables efficient optimization. These insights extend to all generalized linear models and neural networks.
The logarithm is one of the most powerful tools in applied mathematics. For maximum likelihood estimation, it provides multiple critical benefits:
1. Numerical Stability
Multiplying many probabilities causes numerical underflow:
Log-likelihood stays manageable:
2. Mathematical Simplification
Products become sums: $\log \prod_i f_i = \sum_i \log f_i$
Sums are much easier to:
3. Preservation of Optima
Since $\log$ is a strictly increasing function, it preserves the location of maxima and minima:
$$\arg\max_\theta \mathcal{L}(\theta) = \arg\max_\theta \log \mathcal{L}(\theta)$$
Maximizing the log-likelihood gives the same parameter estimates as maximizing the likelihood itself.
Any strictly monotonic transformation preserves the location of extrema. We could use $\sqrt{\mathcal{L}}$ or $\mathcal{L}^{0.1}$ instead of $\log \mathcal{L}$—they all yield the same MLE. We choose logarithm because it also transforms products to sums, making derivatives analytically tractable and connecting to information-theoretic quantities.
Let us carefully derive the log-likelihood function for logistic regression, step by step.
Starting Point — The Likelihood:
$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \pi_i^{y_i} (1 - \pi_i)^{1-y_i}$$
where $\pi_i = \sigma(\boldsymbol{\theta}^T \mathbf{x}_i) = \frac{1}{1 + e^{-\boldsymbol{\theta}^T \mathbf{x}_i}}$
Step 1: Apply Logarithm
$$\ell(\boldsymbol{\theta}) = \log \mathcal{L}(\boldsymbol{\theta}) = \log \prod_{i=1}^{n} \pi_i^{y_i} (1 - \pi_i)^{1-y_i}$$
Step 2: Convert Product to Sum
$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \log \left[ \pi_i^{y_i} (1 - \pi_i)^{1-y_i} \right]$$
Step 3: Apply Logarithm Rules
Using $\log(ab) = \log a + \log b$ and $\log a^b = b \log a$:
$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i \log \pi_i + (1 - y_i) \log(1 - \pi_i) \right]$$
This is exactly the binary cross-entropy between the true labels $y_i$ and predicted probabilities $\pi_i$, summed (with negative sign) over all observations. Cross-entropy loss in machine learning is simply negative log-likelihood!
Step 4: Substitute the Sigmoid
Now we substitute $\pi_i = \sigma(z_i)$ where $z_i = \boldsymbol{\theta}^T \mathbf{x}_i$:
$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i \log \sigma(z_i) + (1 - y_i) \log(1 - \sigma(z_i)) \right]$$
Step 5: Use Sigmoid Properties
Recall the key sigmoid properties:
Substituting:
$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ -y_i \log(1 + e^{-z_i}) + (1-y_i)\left(-z_i - \log(1 + e^{-z_i})\right) \right]$$
Step 6: Simplify
$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ -y_i \log(1 + e^{-z_i}) - z_i + y_i z_i - \log(1 + e^{-z_i}) + y_i \log(1 + e^{-z_i}) \right]$$
Collecting terms:
$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i z_i - z_i - \log(1 + e^{-z_i}) \right]$$
$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i z_i - \log(1 + e^{z_i}) \right]$$
where in the last step we used $z_i + \log(1 + e^{-z_i}) = \log(e^{z_i}) + \log(1 + e^{-z_i}) = \log(e^{z_i} + 1) = \log(1 + e^{z_i})$.
The final expression $\ell(\boldsymbol{\theta}) = \sum_i [y_i z_i - \log(1 + e^{z_i})]$ is elegant: a linear term $y_i z_i$ minus the log-partition function $\log(1 + e^{z_i})$. This is the canonical form for exponential family log-likelihoods:
Different formulations of the log-likelihood are useful in different contexts. Let's catalog the main variants.
Formulation 1: Standard Form
$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i \log \pi_i + (1-y_i) \log(1-\pi_i) \right]$$
Intuition: Each observation contributes the log-probability of its actual outcome.
Formulation 2: Exponential Family Form
$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i (\boldsymbol{\theta}^T \mathbf{x}_i) - \log(1 + e^{\boldsymbol{\theta}^T \mathbf{x}_i}) \right]$$
Intuition: Linear in the natural parameter minus the log-normalizer.
Formulation 3: Compact Vector Form
$$\ell(\boldsymbol{\theta}) = \mathbf{y}^T \mathbf{X} \boldsymbol{\theta} - \mathbf{1}^T \log(\mathbf{1} + \exp(\mathbf{X}\boldsymbol{\theta}))$$
Intuition: Matrix operations for efficient computation.
Formulation 4: Negative Cross-Entropy (Loss Form)
$$\mathcal{J}(\boldsymbol{\theta}) = -\frac{1}{n} \ell(\boldsymbol{\theta}) = \frac{1}{n} \sum_{i=1}^{n} \left[ -y_i \log \pi_i - (1-y_i) \log(1-\pi_i) \right]$$
Intuition: The "loss" we minimize; average cross-entropy per observation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as np def sigmoid(z): """Numerically stable sigmoid function.""" return np.where( z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)) ) def log_likelihood_standard(theta, X, y): """ Standard form: Σ [y_i log(π_i) + (1-y_i) log(1-π_i)] This form is intuitive but requires careful handling to avoid log(0). """ z = X @ theta pi = sigmoid(z) # Clip probabilities to avoid log(0) eps = 1e-15 pi = np.clip(pi, eps, 1 - eps) return np.sum(y * np.log(pi) + (1 - y) * np.log(1 - pi)) def log_likelihood_exponential(theta, X, y): """ Exponential family form: Σ [y_i * z_i - log(1 + exp(z_i))] This form is numerically more stable but still needs care with large z values. """ z = X @ theta # Use log1p for numerical stability when z is large # log(1 + exp(z)) ≈ z when z >> 0 log_partition = np.where( z > 20, z, # For large z: log(1 + exp(z)) ≈ z np.log(1 + np.exp(z)) ) return np.sum(y * z - log_partition) def log_likelihood_stable(theta, X, y): """ Most numerically stable form using log-sum-exp trick. This is the preferred implementation for production code. """ z = X @ theta # Stable computation of log(1 + exp(z)) # = max(0, z) + log(1 + exp(-|z|)) log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z))) return np.sum(y * z - log_partition) # Verify all formulations give same resultnp.random.seed(42)n, p = 100, 3X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])theta_true = np.array([0.5, 1.0, -0.5])z = X @ theta_truey = (np.random.rand(n) < sigmoid(z)).astype(float) # Test with some arbitrary thetatheta_test = np.array([0.3, 0.8, -0.3]) ll_standard = log_likelihood_standard(theta_test, X, y)ll_exp = log_likelihood_exponential(theta_test, X, y)ll_stable = log_likelihood_stable(theta_test, X, y) print("Log-likelihood values (should be equal):")print(f" Standard form: {ll_standard:.6f}")print(f" Exponential form: {ll_exp:.6f}")print(f" Stable form: {ll_stable:.6f}")In production code, always use numerically stable formulations. The naive computation of $\log(1 + e^z)$ overflows for large $z$, and $\log(\pi)$ underflows when $\pi$ is near 0. The stable implementations above handle these edge cases correctly.
The most important structural property of the logistic regression log-likelihood is concavity. A function is concave if its Hessian (matrix of second derivatives) is negative semi-definite everywhere.
Why Concavity Matters:
Proving Concavity:
We prove concavity by showing the Hessian is negative semi-definite. Consider the log-likelihood of a single observation (the sum inherits concavity from its summands):
$$\ell_i(\boldsymbol{\theta}) = y_i (\boldsymbol{\theta}^T \mathbf{x}_i) - \log(1 + e^{\boldsymbol{\theta}^T \mathbf{x}_i})$$
Step 1: First Derivative (Gradient)
Let $z_i = \boldsymbol{\theta}^T \mathbf{x}_i$ and $\pi_i = \sigma(z_i)$.
$$\frac{\partial \ell_i}{\partial \boldsymbol{\theta}} = y_i \mathbf{x}_i - \frac{e^{z_i}}{1 + e^{z_i}} \mathbf{x}_i = y_i \mathbf{x}_i - \pi_i \mathbf{x}_i = (y_i - \pi_i) \mathbf{x}_i$$
Step 2: Second Derivative (Hessian)
$$\frac{\partial^2 \ell_i}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T} = -\frac{\partial \pi_i}{\partial \boldsymbol{\theta}} \mathbf{x}_i^T$$
Using $\frac{\partial \pi_i}{\partial z_i} = \pi_i(1 - \pi_i)$ and $\frac{\partial z_i}{\partial \boldsymbol{\theta}} = \mathbf{x}_i$:
$$\frac{\partial^2 \ell_i}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T} = -\pi_i(1 - \pi_i) \mathbf{x}_i \mathbf{x}_i^T$$
Step 3: Sum Over Observations
The total Hessian is:
$$\mathbf{H}(\boldsymbol{\theta}) = \frac{\partial^2 \ell}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T} = -\sum_{i=1}^{n} \pi_i(1-\pi_i) \mathbf{x}_i \mathbf{x}_i^T$$
In matrix notation:
$$\mathbf{H}(\boldsymbol{\theta}) = -\mathbf{X}^T \mathbf{W} \mathbf{X}$$
where $\mathbf{W} = \text{diag}(\pi_1(1-\pi_1), \ldots, \pi_n(1-\pi_n))$ is a diagonal weight matrix.
Step 4: Negative Semi-Definiteness
For any vector $\mathbf{v} \neq \mathbf{0}$:
$$\mathbf{v}^T \mathbf{H} \mathbf{v} = -\mathbf{v}^T \mathbf{X}^T \mathbf{W} \mathbf{X} \mathbf{v} = -(\mathbf{X}\mathbf{v})^T \mathbf{W} (\mathbf{X}\mathbf{v})$$
Since $\pi_i(1-\pi_i) > 0$ for all $\pi_i \in (0, 1)$, the matrix $\mathbf{W}$ is positive definite. Therefore:
$$(\mathbf{X}\mathbf{v})^T \mathbf{W} (\mathbf{X}\mathbf{v}) \geq 0$$
with equality only if $\mathbf{X}\mathbf{v} = \mathbf{0}$.
Conclusion: $\mathbf{v}^T \mathbf{H} \mathbf{v} \leq 0$ for all $\mathbf{v}$, so $\mathbf{H}$ is negative semi-definite. The log-likelihood is concave. ∎
The log-likelihood is strictly concave (guaranteeing a unique maximum) if and only if $\mathbf{X}$ has full column rank and no $\pi_i \in {0, 1}$. If observations are perfectly separable (all $\pi_i \to 0$ or $1$), the MLE doesn't exist—parameters diverge to $\pm\infty$. We'll address this when discussing regularization.
The concavity of the log-likelihood has beautiful geometric implications that aid intuition.
Level Sets and Contours:
The level sets of the log-likelihood function are convex (for a concave function). As we move outward from the maximum, the log-likelihood decreases smoothly in all directions—there are no "saddle points" or "ridges" that could trap optimization algorithms.
The Parameter Space Landscape:
Imagine standing at a point in parameter space $\boldsymbol{\theta} = (\theta_0, \theta_1, \ldots, \theta_p)$:
Comparison with Non-Convex Landscapes:
Neural networks have non-convex loss surfaces with:
Logistic regression's concave log-likelihood has none of these complications—it's a well-behaved optimization surface.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as np def log_likelihood_surface(theta0_range, theta1_range, X, y): """ Compute log-likelihood over a grid of parameter values. Demonstrates the concave (bowl-shaped) structure. """ T0, T1 = np.meshgrid(theta0_range, theta1_range) log_lik = np.zeros_like(T0) for i in range(len(theta0_range)): for j in range(len(theta1_range)): theta = np.array([theta0_range[i], theta1_range[j]]) z = X @ theta # Stable log-likelihood computation log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z))) log_lik[j, i] = np.sum(y * z - log_partition) return T0, T1, log_lik def verify_hessian_negative_definite(theta, X, y): """ Verify the Hessian is negative semi-definite at theta. """ z = X @ theta pi = 1 / (1 + np.exp(-z)) # Weight matrix W = diag(π_i(1-π_i)) w = pi * (1 - pi) # Hessian H = -X^T W X W = np.diag(w) H = -X.T @ W @ X # Check eigenvalues are all ≤ 0 eigenvalues = np.linalg.eigvalsh(H) print("Hessian Eigenvalue Analysis:") print(f" Eigenvalues: {eigenvalues}") print(f" All ≤ 0? {np.all(eigenvalues <= 1e-10)}") print(f" → Hessian is negative semi-definite: Log-likelihood is concave!") return eigenvalues # Generate example datanp.random.seed(42)n = 100X = np.column_stack([np.ones(n), np.random.randn(n)])theta_true = np.array([-1.0, 2.0])z = X @ theta_truey = (np.random.rand(n) < 1/(1 + np.exp(-z))).astype(float) # Verify at several random parameter valuesfor _ in range(3): theta_test = np.random.randn(2) print(f"\nAt θ = [{theta_test[0]:.2f}, {theta_test[1]:.2f}]:") verify_hessian_negative_definite(theta_test, X, y)The negative expected Hessian $\mathcal{I}(\boldsymbol{\theta}) = -\mathbb{E}[\mathbf{H}(\boldsymbol{\theta})] = \mathbf{X}^T \mathbf{W} \mathbf{X}$ is the Fisher Information Matrix. It measures how much information the data carries about the parameters. Larger Fisher information means more precise parameter estimates (smaller confidence regions).
The gradient of the log-likelihood has a special name in statistics: the score function. Understanding it is key to both optimization and statistical inference.
The Score Function:
$$\mathbf{s}(\boldsymbol{\theta}) = \nabla_\theta \ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} (y_i - \pi_i) \mathbf{x}_i = \mathbf{X}^T (\mathbf{y} - \boldsymbol{\pi})$$
where $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_n)^T$ is the vector of predicted probabilities.
Interpretation:
Each observation contributes $(y_i - \pi_i) \mathbf{x}_i$ to the score:
The Score Equations:
At the maximum likelihood estimate, the score is zero:
$$\mathbf{s}(\hat{\boldsymbol{\theta}}) = \mathbf{X}^T (\mathbf{y} - \hat{\boldsymbol{\pi}}) = \mathbf{0}$$
This is a system of $p+1$ equations (one per parameter) in $p+1$ unknowns.
Moment Interpretation:
The score equations state that, at the MLE, the feature-weighted residuals sum to zero:
$$\sum_{i=1}^{n} (y_i - \hat{\pi}i) x{ij} = 0 \quad \text{for each } j = 0, 1, \ldots, p$$
For the intercept ($j = 0$, $x_{i0} = 1$): $$\sum_{i=1}^{n} y_i = \sum_{i=1}^{n} \hat{\pi}_i$$
The sum of observed 1's equals the sum of predicted probabilities—the model is "calibrated" in aggregate.
It's illuminating to compare the log-likelihood structure of logistic regression with linear regression. Both fit within the generalized linear model framework.
Linear Regression (Gaussian Response):
Model: $y_i = \boldsymbol{\theta}^T \mathbf{x}_i + \epsilon_i$ where $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$
Log-likelihood: $$\ell(\boldsymbol{\theta}) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - \boldsymbol{\theta}^T \mathbf{x}_i)^2$$
Score: $\mathbf{s}(\boldsymbol{\theta}) = \frac{1}{\sigma^2} \mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\theta})$
Setting to zero: $\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}$ → $\hat{\boldsymbol{\theta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$
Logistic Regression (Bernoulli Response):
Model: $y_i \sim \text{Bernoulli}(\sigma(\boldsymbol{\theta}^T \mathbf{x}_i))$
Log-likelihood: $$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i (\boldsymbol{\theta}^T \mathbf{x}_i) - \log(1 + e^{\boldsymbol{\theta}^T \mathbf{x}_i}) \right]$$
Score: $\mathbf{s}(\boldsymbol{\theta}) = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$
Setting to zero: $\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$ — nonlinear in $\boldsymbol{\theta}$!
| Property | Linear Regression | Logistic Regression |
|---|---|---|
| Response Distribution | Gaussian (continuous) | Bernoulli (binary) |
| Link Function | Identity: $\mu = \boldsymbol{\theta}^T\mathbf{x}$ | Logit: $\log\frac{\pi}{1-\pi} = \boldsymbol{\theta}^T\mathbf{x}$ |
| Log-Likelihood Form | Quadratic in $\boldsymbol{\theta}$ | Nonlinear in $\boldsymbol{\theta}$ |
| Score Equation | Linear: $\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}$ | Nonlinear: $\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$ |
| Closed-Form Solution | Yes: $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ | No |
| Hessian | Constant: $-\frac{1}{\sigma^2}\mathbf{X}^T\mathbf{X}$ | Varies: $-\mathbf{X}^T\mathbf{W}(\boldsymbol{\theta})\mathbf{X}$ |
| Optimization | One-shot matrix inversion | Iterative (gradient methods) |
The fundamental difference is that in linear regression, the mapping from parameters to predictions is linear ($\hat{y} = \mathbf{X}\boldsymbol{\theta}$), while in logistic regression, it's nonlinear ($\hat{\pi} = \sigma(\mathbf{X}\boldsymbol{\theta})$). This nonlinearity is why we need iterative optimization methods—the score equations cannot be solved analytically.
We have developed the log-likelihood formulation that is central to training logistic regression models. Let's consolidate the key insights.
What's Next:
We've established the log-likelihood and proven its concavity, but we haven't yet derived the gradient explicitly in a form suitable for optimization algorithms. The next page develops the gradient derivation in detail—computing the partial derivatives, understanding the backpropagation structure, and setting up the foundation for gradient descent and Newton's method.
You now understand why the log-likelihood is essential for optimization, can derive it from the likelihood, appreciate its concavity (negative semi-definite Hessian), and connect it to cross-entropy loss. Next, we'll derive the gradient explicitly for implementation in optimization algorithms.