Loading learning content...
Every optimization algorithm, from basic gradient descent to sophisticated second-order methods, relies on one fundamental object: the gradient. The gradient tells us the direction of steepest increase of a function and, by negation, the direction of steepest descent.
In the previous pages, we established the log-likelihood function for logistic regression and proved its concavity. Now we derive the gradient explicitly—the mathematical machinery that drives parameter updates in training.
This derivation is not merely academic. When you implement logistic regression from scratch, when you debug mysterious convergence issues, when you design custom loss functions for neural networks, you will return to these calculations. The gradient derivation for logistic regression is also the prototype for backpropagation in deep learning—understanding it here illuminates the entire field.
We'll derive the gradient multiple ways: element-by-element, in vector form, and using the chain rule systematically. By the end, you'll have both the formulas for implementation and the intuition for what they mean.
By the end of this page, you will be able to derive the gradient of the logistic regression log-likelihood, express it in compact matrix form for efficient computation, understand the intuition behind the gradient structure, and connect the derivation to the broader chain rule pattern used in neural networks.
Let's establish our notation and the function we're differentiating.
Data and Parameters:
The Log-Likelihood:
$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i \log \pi_i + (1-y_i) \log(1-\pi_i) \right]$$
where the predicted probability for observation $i$ is:
$$\pi_i = \sigma(z_i) = \frac{1}{1 + e^{-z_i}}, \quad z_i = \boldsymbol{\theta}^T \mathbf{x}i = \sum{j=0}^{p} \theta_j x_{ij}$$
What We Seek:
The gradient is the vector of partial derivatives:
$$\nabla_\theta \ell(\boldsymbol{\theta}) = \begin{pmatrix} \frac{\partial \ell}{\partial \theta_0} \ \frac{\partial \ell}{\partial \theta_1} \ \vdots \ \frac{\partial \ell}{\partial \theta_p} \end{pmatrix}$$
Each component tells us how the log-likelihood changes when we adjust the corresponding parameter.
The key insight is that the log-likelihood depends on $\boldsymbol{\theta}$ through a chain: $\boldsymbol{\theta} \to z_i \to \pi_i \to \ell$. Differentiating requires applying the chain rule through each step. This structure is exactly what backpropagation exploits in neural networks.
Let's first derive the gradient contribution from a single observation. This builds intuition that we'll later extend to the full dataset.
Single Observation Log-Likelihood:
$$\ell_i(\boldsymbol{\theta}) = y_i \log \pi_i + (1-y_i) \log(1-\pi_i)$$
Step 1: Apply the Chain Rule
$$\frac{\partial \ell_i}{\partial \theta_j} = \frac{\partial \ell_i}{\partial \pi_i} \cdot \frac{\partial \pi_i}{\partial z_i} \cdot \frac{\partial z_i}{\partial \theta_j}$$
Let's compute each factor.
Step 2: Derivative of Log-Likelihood w.r.t. Probability
$$\frac{\partial \ell_i}{\partial \pi_i} = \frac{y_i}{\pi_i} - \frac{1-y_i}{1-\pi_i}$$
Combining over a common denominator:
$$\frac{\partial \ell_i}{\partial \pi_i} = \frac{y_i(1-\pi_i) - (1-y_i)\pi_i}{\pi_i(1-\pi_i)} = \frac{y_i - \pi_i}{\pi_i(1-\pi_i)}$$
Step 3: Derivative of Sigmoid
This is the famous sigmoid derivative:
$$\frac{\partial \pi_i}{\partial z_i} = \frac{\partial}{\partial z_i} \sigma(z_i) = \sigma(z_i)(1-\sigma(z_i)) = \pi_i(1-\pi_i)$$
The derivative of the sigmoid is expressed entirely in terms of the sigmoid itself: $\sigma'(z) = \sigma(z)(1-\sigma(z))$. This elegant property means we never need to recompute exponentials during backpropagation—we reuse the forward pass values. This efficiency carries over to neural networks.
Step 4: Derivative of Linear Combination
$$\frac{\partial z_i}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \sum_{k=0}^{p} \theta_k x_{ik} = x_{ij}$$
Step 5: Combine via Chain Rule
$$\frac{\partial \ell_i}{\partial \theta_j} = \frac{y_i - \pi_i}{\pi_i(1-\pi_i)} \cdot \pi_i(1-\pi_i) \cdot x_{ij}$$
The $\pi_i(1-\pi_i)$ terms cancel beautifully:
$$\frac{\partial \ell_i}{\partial \theta_j} = (y_i - \pi_i) x_{ij}$$
Interpretation:
The gradient contribution from observation $i$ for parameter $j$ is:
If $y_i = 1$ and $\pi_i < 1$, the residual is positive, pushing $\theta_j$ in the direction of $x_{ij}$ to increase $\pi_i$. If $y_i = 0$ and $\pi_i > 0$, the residual is negative, pushing $\theta_j$ opposite to $x_{ij}$ to decrease $\pi_i$.
The full log-likelihood is the sum over all observations, so the gradient is the sum of individual contributions.
Element-wise Form:
$$\frac{\partial \ell}{\partial \theta_j} = \sum_{i=1}^{n} \frac{\partial \ell_i}{\partial \theta_j} = \sum_{i=1}^{n} (y_i - \pi_i) x_{ij}$$
This holds for each $j = 0, 1, \ldots, p$.
The Full Gradient Vector:
$$\nabla_\theta \ell(\boldsymbol{\theta}) = \begin{pmatrix} \sum_{i=1}^{n} (y_i - \pi_i) x_{i0} \ \sum_{i=1}^{n} (y_i - \pi_i) x_{i1} \ \vdots \ \sum_{i=1}^{n} (y_i - \pi_i) x_{ip} \end{pmatrix}$$
Special Case: The Intercept Gradient
For $j = 0$ (intercept), where $x_{i0} = 1$ for all $i$:
$$\frac{\partial \ell}{\partial \theta_0} = \sum_{i=1}^{n} (y_i - \pi_i)$$
This is simply the sum of residuals. At the MLE, this equals zero:
$$\sum_{i=1}^{n} y_i = \sum_{i=1}^{n} \hat{\pi}_i$$
The total number of positive labels equals the sum of predicted probabilities—a fundamental calibration property.
We derived the gradient of the log-likelihood, which we want to maximize. In optimization libraries and neural network frameworks, we typically minimize a loss function. The loss is the negative log-likelihood: $\mathcal{J} = -\ell$. Its gradient is $\nabla \mathcal{J} = -\nabla \ell = \sum_i (\pi_i - y_i) \mathbf{x}_i$. Watch the sign!
For efficient computation (and elegant notation), we express the gradient in matrix form.
Define the Objects:
The Gradient in Matrix Form:
$$\nabla_\theta \ell(\boldsymbol{\theta}) = \mathbf{X}^T (\mathbf{y} - \boldsymbol{\pi})$$
Verification:
The $j$-th component of $\mathbf{X}^T (\mathbf{y} - \boldsymbol{\pi})$ is:
$$[\mathbf{X}^T (\mathbf{y} - \boldsymbol{\pi})]j = \sum{i=1}^{n} X_{ij} (y_i - \pi_i) = \sum_{i=1}^{n} (y_i - \pi_i) x_{ij}$$
which matches our element-wise derivation. ✓
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as np def sigmoid(z): """Numerically stable sigmoid function.""" return np.where( z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)) ) def gradient_logistic(theta, X, y): """ Compute gradient of log-likelihood for logistic regression. ∇ℓ(θ) = X^T (y - π) Parameters ---------- theta : np.ndarray, shape (p+1,) Parameter vector X : np.ndarray, shape (n, p+1) Design matrix with intercept column y : np.ndarray, shape (n,) Binary response vector Returns ------- np.ndarray, shape (p+1,) Gradient vector """ z = X @ theta # Linear predictor: z = Xθ pi = sigmoid(z) # Probabilities: π = σ(z) residual = y - pi # Residual: y - π gradient = X.T @ residual # Gradient: X^T (y - π) return gradient def gradient_loss(theta, X, y): """ Gradient of the LOSS (negative log-likelihood). Use this for minimization algorithms. ∇J(θ) = -∇ℓ(θ) = X^T (π - y) """ return -gradient_logistic(theta, X, y) # Example usagenp.random.seed(42)n, p = 100, 3X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])theta_true = np.array([0.5, 1.0, -0.5])z = X @ theta_truey = (np.random.rand(n) < sigmoid(z)).astype(float) # Compute gradient at a test pointtheta_test = np.zeros(p)grad = gradient_logistic(theta_test, X, y) print("Gradient at θ = 0:")print(f" ∇ℓ = {grad}")print(f" ||∇ℓ|| = {np.linalg.norm(grad):.4f}") # At the optimum, gradient should be ~0from scipy.optimize import minimizeresult = minimize( lambda t: -np.sum(y * np.log(sigmoid(X @ t) + 1e-15) + (1-y) * np.log(1 - sigmoid(X @ t) + 1e-15)), x0=np.zeros(p), jac=lambda t: gradient_loss(t, X, y), method='BFGS')grad_at_opt = gradient_logistic(result.x, X, y)print(f"\nGradient at MLE θ̂:")print(f" ∇ℓ = {grad_at_opt}")print(f" ||∇ℓ|| = {np.linalg.norm(grad_at_opt):.6f} (≈0 at optimum)")The matrix form $\mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$ computes the gradient in just three operations: (1) forward pass to get $\boldsymbol{\pi}$, (2) residual computation $(\mathbf{y} - \boldsymbol{\pi})$, and (3) one matrix-vector multiplication. For $n$ observations and $p$ features, this is $O(np)$—linear in both dimensions.
The gradient $\nabla_\theta \ell = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$ has a beautiful structure worth understanding deeply.
Residual-Feature Product:
Each gradient component is a weighted sum of feature values, where the weights are residuals:
$$\frac{\partial \ell}{\partial \theta_j} = \langle \mathbf{x}_{\cdot j}, \mathbf{y} - \boldsymbol{\pi} \rangle$$
This is the dot product between the $j$-th feature column and the residual vector.
When Is the Gradient Large?
The gradient for parameter $j$ is large when:
When Is the Gradient Zero?
The gradient for parameter $j$ is zero when feature $j$ is orthogonal to the residual vector. At the MLE, all features are orthogonal to residuals—we've extracted all predictable structure.
| Observation | Label $y_i$ | Prediction $\pi_i$ | Residual $y_i - \pi_i$ | Contribution |
|---|---|---|---|---|
| Correct positive, confident | 1 | 0.95 | +0.05 | Small positive |
| Correct positive, uncertain | 1 | 0.60 | +0.40 | Large positive |
| Correct negative, confident | 0 | 0.05 | -0.05 | Small negative |
| Correct negative, uncertain | 0 | 0.40 | -0.40 | Large negative |
| Misclassified positive | 1 | 0.10 | +0.90 | Very large positive |
| Misclassified negative | 0 | 0.90 | -0.90 | Very large negative |
Comparison with Linear Regression Gradient:
In linear regression with squared error loss:
$$\nabla_\theta \mathcal{J}_{\text{linear}} = -\frac{2}{n} \mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\theta})$$
The structure is identical! Both gradients are $\mathbf{X}^T \times \text{residual}$. The difference is how the prediction is computed:
This unified perspective is the foundation of Generalized Linear Models (GLMs).
For any GLM with canonical link, the gradient takes the form $\mathbf{X}^T(\mathbf{y} - \boldsymbol{\mu})$ where $\boldsymbol{\mu} = g^{-1}(\mathbf{X}\boldsymbol{\theta})$ for link function $g$. Gaussian with identity link (linear regression) and Bernoulli with logit link (logistic regression) are special cases.
When implementing gradients, bugs are common and can be subtle. A crucial debugging technique is gradient checking via finite differences.
Finite Difference Approximation:
The derivative of a function at a point can be approximated by:
$$\frac{\partial f}{\partial \theta_j} \approx \frac{f(\boldsymbol{\theta} + \epsilon \mathbf{e}_j) - f(\boldsymbol{\theta} - \epsilon \mathbf{e}_j)}{2\epsilon}$$
where $\mathbf{e}_j$ is the $j$-th standard basis vector and $\epsilon$ is a small perturbation (typically $10^{-5}$ to $10^{-7}$).
This central difference approximation has error $O(\epsilon^2)$, much better than the one-sided $O(\epsilon)$.
Why This Matters:
Analytical gradients are fast (computed in one pass). Numerical gradients are slow ($O(p)$ function evaluations) but easy to get right. By comparing them, we verify our analytical implementation is correct.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as np def sigmoid(z): return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) def log_likelihood(theta, X, y): """Compute log-likelihood.""" z = X @ theta # Stable computation log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z))) return np.sum(y * z - log_partition) def gradient_analytical(theta, X, y): """Analytical gradient: X^T (y - π)""" z = X @ theta pi = sigmoid(z) return X.T @ (y - pi) def gradient_numerical(theta, X, y, epsilon=1e-5): """ Numerical gradient via central differences. For each parameter j: ∂ℓ/∂θⱼ ≈ [ℓ(θ + ε eⱼ) - ℓ(θ - ε eⱼ)] / (2ε) """ grad = np.zeros_like(theta) for j in range(len(theta)): # Add epsilon to parameter j theta_plus = theta.copy() theta_plus[j] += epsilon # Subtract epsilon from parameter j theta_minus = theta.copy() theta_minus[j] -= epsilon # Central difference grad[j] = (log_likelihood(theta_plus, X, y) - log_likelihood(theta_minus, X, y)) / (2 * epsilon) return grad def check_gradient(theta, X, y, epsilon=1e-5, tolerance=1e-5): """ Compare analytical and numerical gradients. Returns True if they match within tolerance. """ grad_ana = gradient_analytical(theta, X, y) grad_num = gradient_numerical(theta, X, y, epsilon) # Relative error (handling near-zero components) diff = np.abs(grad_ana - grad_num) denom = np.maximum(np.abs(grad_ana) + np.abs(grad_num), 1e-8) relative_error = diff / denom max_error = np.max(relative_error) print("Gradient Check:") print(f" Analytical: {grad_ana}") print(f" Numerical: {grad_num}") print(f" Max relative error: {max_error:.2e}") print(f" Status: {'PASS ✓' if max_error < tolerance else 'FAIL ✗'}") return max_error < tolerance # Test gradient implementationnp.random.seed(42)n, p = 50, 4X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])theta_true = np.random.randn(p)z = X @ theta_truey = (np.random.rand(n) < sigmoid(z)).astype(float) # Check at multiple random pointsprint("=== Gradient Verification ===\n")for i in range(3): theta_test = np.random.randn(p) print(f"Test point {i+1}:") check_gradient(theta_test, X, y) print()Gradient checking should be part of your workflow whenever implementing new loss functions or networks. A bug in the gradient will cause optimization to converge to wrong solutions or fail entirely. The few seconds spent on verification can save hours of debugging mysterious training failures.
Logistic regression is a single-layer neural network with sigmoid activation and cross-entropy loss. The gradient derivation we just performed is exactly backpropagation for this simple case.
The Computation Graph:
$$\boldsymbol{\theta} \xrightarrow{\text{linear}} \mathbf{z} = \mathbf{X}\boldsymbol{\theta} \xrightarrow{\sigma} \boldsymbol{\pi} \xrightarrow{-\log} \ell$$
Forward Pass:
Backward Pass (Backpropagation):
Why the Sigmoid + Cross-Entropy Combination Is Special:
The remarkable cancellation $\frac{y - \pi}{\pi(1-\pi)} \cdot \pi(1-\pi) = y - \pi$ is not a coincidence. It arises because:
This is why modern neural networks use softmax (multi-class sigmoid) with cross-entropy loss: the gradients are numerically stable and computationally efficient.
When you train a neural network with sigmoid outputs and cross-entropy loss, the gradient at the output layer is exactly $(\pi - y)$. Everything we derived here applies, with additional layers contributing their own chain rule factors. Understanding logistic regression's gradient is understanding the output layer of classification networks.
We have derived the gradient of the log-likelihood for logistic regression from multiple perspectives. This gradient is the engine that drives optimization.
What's Next:
We now have the log-likelihood, its concavity proof, and the gradient. Can we solve the score equations $\nabla_\theta \ell = \mathbf{0}$ analytically? The next page explores why no closed-form solution exists for logistic regression—the nonlinearity of the sigmoid prevents the algebraic simplifications that work for linear regression. This motivates the iterative optimization methods we'll develop afterward.
You can now derive the logistic regression gradient from scratch, implement it efficiently in matrix form, verify implementations via gradient checking, and connect the derivation to neural network backpropagation. Next, we'll understand why analytic solutions don't exist.