Machine LearningLogistic Regression & GLMs

Maximum Likelihood Estimation for Logistic Regression

LevelIntermediate

Duration120 mins

TopicLogistic Regression & GLMs

3 / 5

Gradient Derivation

The Heart of Optimization

Every optimization algorithm, from basic gradient descent to sophisticated second-order methods, relies on one fundamental object: the gradient. The gradient tells us the direction of steepest increase of a function and, by negation, the direction of steepest descent.

In the previous pages, we established the log-likelihood function for logistic regression and proved its concavity. Now we derive the gradient explicitly—the mathematical machinery that drives parameter updates in training.

This derivation is not merely academic. When you implement logistic regression from scratch, when you debug mysterious convergence issues, when you design custom loss functions for neural networks, you will return to these calculations. The gradient derivation for logistic regression is also the prototype for backpropagation in deep learning—understanding it here illuminates the entire field.

We'll derive the gradient multiple ways: element-by-element, in vector form, and using the chain rule systematically. By the end, you'll have both the formulas for implementation and the intuition for what they mean.

What You Will Learn

By the end of this page, you will be able to derive the gradient of the logistic regression log-likelihood, express it in compact matrix form for efficient computation, understand the intuition behind the gradient structure, and connect the derivation to the broader chain rule pattern used in neural networks.

Setting Up the Derivation

Let's establish our notation and the function we're differentiating.

Data and Parameters:

$n$ observations: ${(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)}$
Each $\mathbf{x}i \in \mathbb{R}^{p+1}$ (including intercept as $x{i0} = 1$)
Each $y_i \in {0, 1}$
Parameter vector $\boldsymbol{\theta} = (\theta_0, \theta_1, \ldots, \theta_p)^T \in \mathbb{R}^{p+1}$

The Log-Likelihood:

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i \log \pi_i + (1-y_i) \log(1-\pi_i) \right]$$

where the predicted probability for observation $i$ is:

$$\pi_i = \sigma(z_i) = \frac{1}{1 + e^{-z_i}}, \quad z_i = \boldsymbol{\theta}^T \mathbf{x}i = \sum{j=0}^{p} \theta_j x_{ij}$$

What We Seek:

The gradient is the vector of partial derivatives:

$$\nabla_\theta \ell(\boldsymbol{\theta}) = \begin{pmatrix} \frac{\partial \ell}{\partial \theta_0} \ \frac{\partial \ell}{\partial \theta_1} \ \vdots \ \frac{\partial \ell}{\partial \theta_p} \end{pmatrix}$$

Each component tells us how the log-likelihood changes when we adjust the corresponding parameter.

The Chain Rule Structure

The key insight is that the log-likelihood depends on $\boldsymbol{\theta}$ through a chain: $\boldsymbol{\theta} \to z_i \to \pi_i \to \ell$. Differentiating requires applying the chain rule through each step. This structure is exactly what backpropagation exploits in neural networks.

Scalar Derivation: Single Observation

Let's first derive the gradient contribution from a single observation. This builds intuition that we'll later extend to the full dataset.

Single Observation Log-Likelihood:

$$\ell_i(\boldsymbol{\theta}) = y_i \log \pi_i + (1-y_i) \log(1-\pi_i)$$

Step 1: Apply the Chain Rule

$$\frac{\partial \ell_i}{\partial \theta_j} = \frac{\partial \ell_i}{\partial \pi_i} \cdot \frac{\partial \pi_i}{\partial z_i} \cdot \frac{\partial z_i}{\partial \theta_j}$$

Let's compute each factor.

Step 2: Derivative of Log-Likelihood w.r.t. Probability

$$\frac{\partial \ell_i}{\partial \pi_i} = \frac{y_i}{\pi_i} - \frac{1-y_i}{1-\pi_i}$$

Combining over a common denominator:

$$\frac{\partial \ell_i}{\partial \pi_i} = \frac{y_i(1-\pi_i) - (1-y_i)\pi_i}{\pi_i(1-\pi_i)} = \frac{y_i - \pi_i}{\pi_i(1-\pi_i)}$$

Step 3: Derivative of Sigmoid

This is the famous sigmoid derivative:

$$\frac{\partial \pi_i}{\partial z_i} = \frac{\partial}{\partial z_i} \sigma(z_i) = \sigma(z_i)(1-\sigma(z_i)) = \pi_i(1-\pi_i)$$

The Beautiful Sigmoid Derivative

The derivative of the sigmoid is expressed entirely in terms of the sigmoid itself: $\sigma'(z) = \sigma(z)(1-\sigma(z))$. This elegant property means we never need to recompute exponentials during backpropagation—we reuse the forward pass values. This efficiency carries over to neural networks.

Step 4: Derivative of Linear Combination

$$\frac{\partial z_i}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \sum_{k=0}^{p} \theta_k x_{ik} = x_{ij}$$

Step 5: Combine via Chain Rule

$$\frac{\partial \ell_i}{\partial \theta_j} = \frac{y_i - \pi_i}{\pi_i(1-\pi_i)} \cdot \pi_i(1-\pi_i) \cdot x_{ij}$$

The $\pi_i(1-\pi_i)$ terms cancel beautifully:

$$\frac{\partial \ell_i}{\partial \theta_j} = (y_i - \pi_i) x_{ij}$$

Interpretation:

The gradient contribution from observation $i$ for parameter $j$ is:

The residual $(y_i - \pi_i)$: how much the prediction deviates from the truth
Times the feature value $x_{ij}$: the "input" that produced this prediction

If $y_i = 1$ and $\pi_i < 1$, the residual is positive, pushing $\theta_j$ in the direction of $x_{ij}$ to increase $\pi_i$. If $y_i = 0$ and $\pi_i > 0$, the residual is negative, pushing $\theta_j$ opposite to $x_{ij}$ to decrease $\pi_i$.

Full Gradient: Summing Over Observations

The full log-likelihood is the sum over all observations, so the gradient is the sum of individual contributions.

Element-wise Form:

$$\frac{\partial \ell}{\partial \theta_j} = \sum_{i=1}^{n} \frac{\partial \ell_i}{\partial \theta_j} = \sum_{i=1}^{n} (y_i - \pi_i) x_{ij}$$

This holds for each $j = 0, 1, \ldots, p$.

The Full Gradient Vector:

$$\nabla_\theta \ell(\boldsymbol{\theta}) = \begin{pmatrix} \sum_{i=1}^{n} (y_i - \pi_i) x_{i0} \ \sum_{i=1}^{n} (y_i - \pi_i) x_{i1} \ \vdots \ \sum_{i=1}^{n} (y_i - \pi_i) x_{ip} \end{pmatrix}$$

Special Case: The Intercept Gradient

For $j = 0$ (intercept), where $x_{i0} = 1$ for all $i$:

$$\frac{\partial \ell}{\partial \theta_0} = \sum_{i=1}^{n} (y_i - \pi_i)$$

This is simply the sum of residuals. At the MLE, this equals zero:

$$\sum_{i=1}^{n} y_i = \sum_{i=1}^{n} \hat{\pi}_i$$

The total number of positive labels equals the sum of predicted probabilities—a fundamental calibration property.

Gradient Sign Convention

We derived the gradient of the log-likelihood, which we want to maximize. In optimization libraries and neural network frameworks, we typically minimize a loss function. The loss is the negative log-likelihood: $\mathcal{J} = -\ell$. Its gradient is $\nabla \mathcal{J} = -\nabla \ell = \sum_i (\pi_i - y_i) \mathbf{x}_i$. Watch the sign!

Matrix/Vector Form

For efficient computation (and elegant notation), we express the gradient in matrix form.

Define the Objects:

Design matrix: $\mathbf{X} \in \mathbb{R}^{n \times (p+1)}$ with row $i$ equal to $\mathbf{x}_i^T$
Response vector: $\mathbf{y} = (y_1, \ldots, y_n)^T \in {0,1}^n$
Linear predictor vector: $\mathbf{z} = \mathbf{X}\boldsymbol{\theta} \in \mathbb{R}^n$
Probability vector: $\boldsymbol{\pi} = \sigma(\mathbf{z}) = (\pi_1, \ldots, \pi_n)^T \in (0,1)^n$

The Gradient in Matrix Form:

$$\nabla_\theta \ell(\boldsymbol{\theta}) = \mathbf{X}^T (\mathbf{y} - \boldsymbol{\pi})$$

Verification:

The $j$-th component of $\mathbf{X}^T (\mathbf{y} - \boldsymbol{\pi})$ is:

$$[\mathbf{X}^T (\mathbf{y} - \boldsymbol{\pi})]j = \sum{i=1}^{n} X_{ij} (y_i - \pi_i) = \sum_{i=1}^{n} (y_i - \pi_i) x_{ij}$$

which matches our element-wise derivation. ✓

Gradient Computation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
 
def sigmoid(z):
    """Numerically stable sigmoid function."""
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z))
    )
 
def gradient_logistic(theta, X, y):
    """
    Compute gradient of log-likelihood for logistic regression.
    
    ∇ℓ(θ) = X^T (y - π)
    
    Parameters
    ----------
    theta : np.ndarray, shape (p+1,)
        Parameter vector
    X : np.ndarray, shape (n, p+1)
        Design matrix with intercept column
    y : np.ndarray, shape (n,)
        Binary response vector
    
    Returns
    -------
    np.ndarray, shape (p+1,)
        Gradient vector
    """
    z = X @ theta           # Linear predictor: z = Xθ
    pi = sigmoid(z)         # Probabilities: π = σ(z)
    residual = y - pi       # Residual: y - π
    gradient = X.T @ residual  # Gradient: X^T (y - π)
    return gradient
 
def gradient_loss(theta, X, y):
    """
    Gradient of the LOSS (negative log-likelihood).
    Use this for minimization algorithms.
    
    ∇J(θ) = -∇ℓ(θ) = X^T (π - y)
    """
    return -gradient_logistic(theta, X, y)
 
# Example usage
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.array([0.5, 1.0, -0.5])
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
# Compute gradient at a test point
theta_test = np.zeros(p)
grad = gradient_logistic(theta_test, X, y)
 
print("Gradient at θ = 0:")
print(f"  ∇ℓ = {grad}")
print(f"  ||∇ℓ|| = {np.linalg.norm(grad):.4f}")
 
# At the optimum, gradient should be ~0
from scipy.optimize import minimize
result = minimize(
    lambda t: -np.sum(y * np.log(sigmoid(X @ t) + 1e-15) + 
                      (1-y) * np.log(1 - sigmoid(X @ t) + 1e-15)),
    x0=np.zeros(p),
    jac=lambda t: gradient_loss(t, X, y),
    method='BFGS'
)
grad_at_opt = gradient_logistic(result.x, X, y)
print(f"\nGradient at MLE θ̂:")
print(f"  ∇ℓ = {grad_at_opt}")
print(f"  ||∇ℓ|| = {np.linalg.norm(grad_at_opt):.6f} (≈0 at optimum)")

Computational Efficiency

The matrix form $\mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$ computes the gradient in just three operations: (1) forward pass to get $\boldsymbol{\pi}$, (2) residual computation $(\mathbf{y} - \boldsymbol{\pi})$, and (3) one matrix-vector multiplication. For $n$ observations and $p$ features, this is $O(np)$—linear in both dimensions.

Understanding the Gradient Structure

The gradient $\nabla_\theta \ell = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$ has a beautiful structure worth understanding deeply.

Residual-Feature Product:

Each gradient component is a weighted sum of feature values, where the weights are residuals:

$$\frac{\partial \ell}{\partial \theta_j} = \langle \mathbf{x}_{\cdot j}, \mathbf{y} - \boldsymbol{\pi} \rangle$$

This is the dot product between the $j$-th feature column and the residual vector.

When Is the Gradient Large?

The gradient for parameter $j$ is large when:

Large residuals: Predictions $\pi_i$ are far from labels $y_i$
Non-zero features: $x_{ij}$ values are significant
Alignment: Residuals are systematically related to feature $j$

When Is the Gradient Zero?

The gradient for parameter $j$ is zero when feature $j$ is orthogonal to the residual vector. At the MLE, all features are orthogonal to residuals—we've extracted all predictable structure.

Gradient Contribution by Observation Type
Observation	Label $y_i$	Prediction $\pi_i$	Residual $y_i - \pi_i$	Contribution
Correct positive, confident	1	0.95	+0.05	Small positive
Correct positive, uncertain	1	0.60	+0.40	Large positive
Correct negative, confident	0	0.05	-0.05	Small negative
Correct negative, uncertain	0	0.40	-0.40	Large negative
Misclassified positive	1	0.10	+0.90	Very large positive
Misclassified negative	0	0.90	-0.90	Very large negative

Comparison with Linear Regression Gradient:

In linear regression with squared error loss:

$$\nabla_\theta \mathcal{J}_{\text{linear}} = -\frac{2}{n} \mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\theta})$$

The structure is identical! Both gradients are $\mathbf{X}^T \times \text{residual}$. The difference is how the prediction is computed:

Linear regression: $\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\theta}$ (linear)
Logistic regression: $\hat{\boldsymbol{\pi}} = \sigma(\mathbf{X}\boldsymbol{\theta})$ (nonlinear)

This unified perspective is the foundation of Generalized Linear Models (GLMs).

The GLM Pattern

For any GLM with canonical link, the gradient takes the form $\mathbf{X}^T(\mathbf{y} - \boldsymbol{\mu})$ where $\boldsymbol{\mu} = g^{-1}(\mathbf{X}\boldsymbol{\theta})$ for link function $g$. Gaussian with identity link (linear regression) and Bernoulli with logit link (logistic regression) are special cases.

Gradient Verification via Finite Differences

When implementing gradients, bugs are common and can be subtle. A crucial debugging technique is gradient checking via finite differences.

Finite Difference Approximation:

The derivative of a function at a point can be approximated by:

$$\frac{\partial f}{\partial \theta_j} \approx \frac{f(\boldsymbol{\theta} + \epsilon \mathbf{e}_j) - f(\boldsymbol{\theta} - \epsilon \mathbf{e}_j)}{2\epsilon}$$

where $\mathbf{e}_j$ is the $j$-th standard basis vector and $\epsilon$ is a small perturbation (typically $10^{-5}$ to $10^{-7}$).

This central difference approximation has error $O(\epsilon^2)$, much better than the one-sided $O(\epsilon)$.

Why This Matters:

Analytical gradients are fast (computed in one pass). Numerical gradients are slow ($O(p)$ function evaluations) but easy to get right. By comparing them, we verify our analytical implementation is correct.

Gradient Checking
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def log_likelihood(theta, X, y):
    """Compute log-likelihood."""
    z = X @ theta
    # Stable computation
    log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
    return np.sum(y * z - log_partition)
 
def gradient_analytical(theta, X, y):
    """Analytical gradient: X^T (y - π)"""
    z = X @ theta
    pi = sigmoid(z)
    return X.T @ (y - pi)
 
def gradient_numerical(theta, X, y, epsilon=1e-5):
    """
    Numerical gradient via central differences.
    
    For each parameter j:
    ∂ℓ/∂θⱼ ≈ [ℓ(θ + ε eⱼ) - ℓ(θ - ε eⱼ)] / (2ε)
    """
    grad = np.zeros_like(theta)
    
    for j in range(len(theta)):
        # Add epsilon to parameter j
        theta_plus = theta.copy()
        theta_plus[j] += epsilon
        
        # Subtract epsilon from parameter j
        theta_minus = theta.copy()
        theta_minus[j] -= epsilon
        
        # Central difference
        grad[j] = (log_likelihood(theta_plus, X, y) - 
                   log_likelihood(theta_minus, X, y)) / (2 * epsilon)
    
    return grad
 
def check_gradient(theta, X, y, epsilon=1e-5, tolerance=1e-5):
    """
    Compare analytical and numerical gradients.
    Returns True if they match within tolerance.
    """
    grad_ana = gradient_analytical(theta, X, y)
    grad_num = gradient_numerical(theta, X, y, epsilon)
    
    # Relative error (handling near-zero components)
    diff = np.abs(grad_ana - grad_num)
    denom = np.maximum(np.abs(grad_ana) + np.abs(grad_num), 1e-8)
    relative_error = diff / denom
    max_error = np.max(relative_error)
    
    print("Gradient Check:")
    print(f"  Analytical: {grad_ana}")
    print(f"  Numerical:  {grad_num}")
    print(f"  Max relative error: {max_error:.2e}")
    print(f"  Status: {'PASS ✓' if max_error < tolerance else 'FAIL ✗'}")
    
    return max_error < tolerance
 
# Test gradient implementation
np.random.seed(42)
n, p = 50, 4
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.random.randn(p)
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
# Check at multiple random points
print("=== Gradient Verification ===\n")
for i in range(3):
    theta_test = np.random.randn(p)
    print(f"Test point {i+1}:")
    check_gradient(theta_test, X, y)
    print()

Always Verify Your Gradients

Gradient checking should be part of your workflow whenever implementing new loss functions or networks. A bug in the gradient will cause optimization to converge to wrong solutions or fail entirely. The few seconds spent on verification can save hours of debugging mysterious training failures.

Connection to Neural Network Backpropagation

Logistic regression is a single-layer neural network with sigmoid activation and cross-entropy loss. The gradient derivation we just performed is exactly backpropagation for this simple case.

The Computation Graph:

$$\boldsymbol{\theta} \xrightarrow{\text{linear}} \mathbf{z} = \mathbf{X}\boldsymbol{\theta} \xrightarrow{\sigma} \boldsymbol{\pi} \xrightarrow{-\log} \ell$$

Forward Pass:

Compute linear combination: $z_i = \boldsymbol{\theta}^T \mathbf{x}_i$
Apply sigmoid activation: $\pi_i = \sigma(z_i)$
Compute loss: $\ell = \sum_i [y_i \log \pi_i + (1-y_i)\log(1-\pi_i)]$

Backward Pass (Backpropagation):

Loss → Probability: $\frac{\partial \ell}{\partial \pi_i} = \frac{y_i - \pi_i}{\pi_i(1-\pi_i)}$
Probability → Linear: $\frac{\partial \pi_i}{\partial z_i} = \pi_i(1-\pi_i)$
Combine: $\frac{\partial \ell}{\partial z_i} = y_i - \pi_i$ (the $\pi_i(1-\pi_i)$ cancels!)
Linear → Parameters: $\frac{\partial z_i}{\partial \boldsymbol{\theta}} = \mathbf{x}_i$
Final: $\frac{\partial \ell}{\partial \boldsymbol{\theta}} = \sum_i (y_i - \pi_i) \mathbf{x}_i = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$

Key Backpropagation Insights

•Chain rule: Derivatives flow backward through the computation graph, multiplying at each step
•Local derivatives: Each node only needs its own local derivative; propagation handles the rest
•Reuse forward values: $\pi_i$ computed in forward pass is reused in backward pass for $\sigma'(z_i)$
•Cancellation: Cross-entropy + sigmoid gives clean gradient $(y - \pi)$—no numerical issues
•Generalization: Multi-layer networks apply the same principle recursively through all layers

Why the Sigmoid + Cross-Entropy Combination Is Special:

The remarkable cancellation $\frac{y - \pi}{\pi(1-\pi)} \cdot \pi(1-\pi) = y - \pi$ is not a coincidence. It arises because:

Cross-entropy is the natural loss for Bernoulli likelihood
Sigmoid is the natural (canonical) link for Bernoulli
Together, they give the canonical exponential family gradient

This is why modern neural networks use softmax (multi-class sigmoid) with cross-entropy loss: the gradients are numerically stable and computationally efficient.

Deep Learning Connection

When you train a neural network with sigmoid outputs and cross-entropy loss, the gradient at the output layer is exactly $(\pi - y)$. Everything we derived here applies, with additional layers contributing their own chain rule factors. Understanding logistic regression's gradient is understanding the output layer of classification networks.

Summary and Preview

We have derived the gradient of the log-likelihood for logistic regression from multiple perspectives. This gradient is the engine that drives optimization.

Key Takeaways

•Chain Rule Structure: The gradient flows through $\boldsymbol{\theta} \to z \to \pi \to \ell$, with each step contributing a factor
•Sigmoid Derivative: $\sigma'(z) = \sigma(z)(1-\sigma(z))$—expressed in terms of the forward pass value
•Beautiful Cancellation: The $\pi(1-\pi)$ terms cancel, yielding the clean gradient $(y_i - \pi_i)x_{ij}$
•Matrix Form: $\nabla_\theta \ell = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$—efficient for vectorized computation
•Residual-Feature Structure: The gradient is the dot product of residuals with feature columns
•Gradient Checking: Always verify analytical gradients with finite differences during implementation
•Backpropagation Foundation: This derivation is exactly backprop for a one-layer network with sigmoid and cross-entropy

What's Next:

We now have the log-likelihood, its concavity proof, and the gradient. Can we solve the score equations $\nabla_\theta \ell = \mathbf{0}$ analytically? The next page explores why no closed-form solution exists for logistic regression—the nonlinearity of the sigmoid prevents the algebraic simplifications that work for linear regression. This motivates the iterative optimization methods we'll develop afterward.

Page Complete

You can now derive the logistic regression gradient from scratch, implement it efficiently in matrix form, verify implementations via gradient checking, and connect the derivation to neural network backpropagation. Next, we'll understand why analytic solutions don't exist.

3 / 5

Loading learning content...

Machine LearningLogistic Regression & GLMs

Maximum Likelihood Estimation for Logistic Regression

LevelIntermediate

Duration120 mins

TopicLogistic Regression & GLMs

3 / 5

Gradient Derivation

The Heart of Optimization

What You Will Learn

Setting Up the Derivation

Let's establish our notation and the function we're differentiating.

Data and Parameters:

$n$ observations: ${(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)}$
Each $\mathbf{x}i \in \mathbb{R}^{p+1}$ (including intercept as $x{i0} = 1$)
Each $y_i \in {0, 1}$
Parameter vector $\boldsymbol{\theta} = (\theta_0, \theta_1, \ldots, \theta_p)^T \in \mathbb{R}^{p+1}$

The Log-Likelihood:

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i \log \pi_i + (1-y_i) \log(1-\pi_i) \right]$$

where the predicted probability for observation $i$ is:

$$\pi_i = \sigma(z_i) = \frac{1}{1 + e^{-z_i}}, \quad z_i = \boldsymbol{\theta}^T \mathbf{x}i = \sum{j=0}^{p} \theta_j x_{ij}$$

What We Seek:

The gradient is the vector of partial derivatives:

Each component tells us how the log-likelihood changes when we adjust the corresponding parameter.

The Chain Rule Structure

Scalar Derivation: Single Observation

Let's first derive the gradient contribution from a single observation. This builds intuition that we'll later extend to the full dataset.

Single Observation Log-Likelihood:

$$\ell_i(\boldsymbol{\theta}) = y_i \log \pi_i + (1-y_i) \log(1-\pi_i)$$

Step 1: Apply the Chain Rule

$$\frac{\partial \ell_i}{\partial \theta_j} = \frac{\partial \ell_i}{\partial \pi_i} \cdot \frac{\partial \pi_i}{\partial z_i} \cdot \frac{\partial z_i}{\partial \theta_j}$$

Let's compute each factor.

Step 2: Derivative of Log-Likelihood w.r.t. Probability

$$\frac{\partial \ell_i}{\partial \pi_i} = \frac{y_i}{\pi_i} - \frac{1-y_i}{1-\pi_i}$$

Combining over a common denominator:

$$\frac{\partial \ell_i}{\partial \pi_i} = \frac{y_i(1-\pi_i) - (1-y_i)\pi_i}{\pi_i(1-\pi_i)} = \frac{y_i - \pi_i}{\pi_i(1-\pi_i)}$$

Step 3: Derivative of Sigmoid

This is the famous sigmoid derivative:

$$\frac{\partial \pi_i}{\partial z_i} = \frac{\partial}{\partial z_i} \sigma(z_i) = \sigma(z_i)(1-\sigma(z_i)) = \pi_i(1-\pi_i)$$

The Beautiful Sigmoid Derivative

Step 4: Derivative of Linear Combination

$$\frac{\partial z_i}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \sum_{k=0}^{p} \theta_k x_{ik} = x_{ij}$$

Step 5: Combine via Chain Rule

$$\frac{\partial \ell_i}{\partial \theta_j} = \frac{y_i - \pi_i}{\pi_i(1-\pi_i)} \cdot \pi_i(1-\pi_i) \cdot x_{ij}$$

The $\pi_i(1-\pi_i)$ terms cancel beautifully:

$$\frac{\partial \ell_i}{\partial \theta_j} = (y_i - \pi_i) x_{ij}$$

Interpretation:

The gradient contribution from observation $i$ for parameter $j$ is:

The residual $(y_i - \pi_i)$: how much the prediction deviates from the truth
Times the feature value $x_{ij}$: the "input" that produced this prediction

Full Gradient: Summing Over Observations

The full log-likelihood is the sum over all observations, so the gradient is the sum of individual contributions.

Element-wise Form:

$$\frac{\partial \ell}{\partial \theta_j} = \sum_{i=1}^{n} \frac{\partial \ell_i}{\partial \theta_j} = \sum_{i=1}^{n} (y_i - \pi_i) x_{ij}$$

This holds for each $j = 0, 1, \ldots, p$.

The Full Gradient Vector:

$$\nabla_\theta \ell(\boldsymbol{\theta}) = \begin{pmatrix} \sum_{i=1}^{n} (y_i - \pi_i) x_{i0} \ \sum_{i=1}^{n} (y_i - \pi_i) x_{i1} \ \vdots \ \sum_{i=1}^{n} (y_i - \pi_i) x_{ip} \end{pmatrix}$$

Special Case: The Intercept Gradient

For $j = 0$ (intercept), where $x_{i0} = 1$ for all $i$:

$$\frac{\partial \ell}{\partial \theta_0} = \sum_{i=1}^{n} (y_i - \pi_i)$$

This is simply the sum of residuals. At the MLE, this equals zero:

$$\sum_{i=1}^{n} y_i = \sum_{i=1}^{n} \hat{\pi}_i$$

The total number of positive labels equals the sum of predicted probabilities—a fundamental calibration property.

Gradient Sign Convention

Matrix/Vector Form

For efficient computation (and elegant notation), we express the gradient in matrix form.

Define the Objects:

Design matrix: $\mathbf{X} \in \mathbb{R}^{n \times (p+1)}$ with row $i$ equal to $\mathbf{x}_i^T$
Response vector: $\mathbf{y} = (y_1, \ldots, y_n)^T \in {0,1}^n$
Linear predictor vector: $\mathbf{z} = \mathbf{X}\boldsymbol{\theta} \in \mathbb{R}^n$
Probability vector: $\boldsymbol{\pi} = \sigma(\mathbf{z}) = (\pi_1, \ldots, \pi_n)^T \in (0,1)^n$

The Gradient in Matrix Form:

$$\nabla_\theta \ell(\boldsymbol{\theta}) = \mathbf{X}^T (\mathbf{y} - \boldsymbol{\pi})$$

Verification:

The $j$-th component of $\mathbf{X}^T (\mathbf{y} - \boldsymbol{\pi})$ is:

$$[\mathbf{X}^T (\mathbf{y} - \boldsymbol{\pi})]j = \sum{i=1}^{n} X_{ij} (y_i - \pi_i) = \sum_{i=1}^{n} (y_i - \pi_i) x_{ij}$$

which matches our element-wise derivation. ✓

Gradient Computation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
 
def sigmoid(z):
    """Numerically stable sigmoid function."""
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z))
    )
 
def gradient_logistic(theta, X, y):
    """
    Compute gradient of log-likelihood for logistic regression.
    
    ∇ℓ(θ) = X^T (y - π)
    
    Parameters
    ----------
    theta : np.ndarray, shape (p+1,)
        Parameter vector
    X : np.ndarray, shape (n, p+1)
        Design matrix with intercept column
    y : np.ndarray, shape (n,)
        Binary response vector
    
    Returns
    -------
    np.ndarray, shape (p+1,)
        Gradient vector
    """
    z = X @ theta           # Linear predictor: z = Xθ
    pi = sigmoid(z)         # Probabilities: π = σ(z)
    residual = y - pi       # Residual: y - π
    gradient = X.T @ residual  # Gradient: X^T (y - π)
    return gradient
 
def gradient_loss(theta, X, y):
    """
    Gradient of the LOSS (negative log-likelihood).
    Use this for minimization algorithms.
    
    ∇J(θ) = -∇ℓ(θ) = X^T (π - y)
    """
    return -gradient_logistic(theta, X, y)
 
# Example usage
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.array([0.5, 1.0, -0.5])
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
# Compute gradient at a test point
theta_test = np.zeros(p)
grad = gradient_logistic(theta_test, X, y)
 
print("Gradient at θ = 0:")
print(f"  ∇ℓ = {grad}")
print(f"  ||∇ℓ|| = {np.linalg.norm(grad):.4f}")
 
# At the optimum, gradient should be ~0
from scipy.optimize import minimize
result = minimize(
    lambda t: -np.sum(y * np.log(sigmoid(X @ t) + 1e-15) + 
                      (1-y) * np.log(1 - sigmoid(X @ t) + 1e-15)),
    x0=np.zeros(p),
    jac=lambda t: gradient_loss(t, X, y),
    method='BFGS'
)
grad_at_opt = gradient_logistic(result.x, X, y)
print(f"\nGradient at MLE θ̂:")
print(f"  ∇ℓ = {grad_at_opt}")
print(f"  ||∇ℓ|| = {np.linalg.norm(grad_at_opt):.6f} (≈0 at optimum)")

Computational Efficiency

Understanding the Gradient Structure

The gradient $\nabla_\theta \ell = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$ has a beautiful structure worth understanding deeply.

Residual-Feature Product:

Each gradient component is a weighted sum of feature values, where the weights are residuals:

$$\frac{\partial \ell}{\partial \theta_j} = \langle \mathbf{x}_{\cdot j}, \mathbf{y} - \boldsymbol{\pi} \rangle$$

This is the dot product between the $j$-th feature column and the residual vector.

When Is the Gradient Large?

The gradient for parameter $j$ is large when:

Large residuals: Predictions $\pi_i$ are far from labels $y_i$
Non-zero features: $x_{ij}$ values are significant
Alignment: Residuals are systematically related to feature $j$

When Is the Gradient Zero?

The gradient for parameter $j$ is zero when feature $j$ is orthogonal to the residual vector. At the MLE, all features are orthogonal to residuals—we've extracted all predictable structure.

Gradient Contribution by Observation Type
Observation	Label $y_i$	Prediction $\pi_i$	Residual $y_i - \pi_i$	Contribution
Correct positive, confident	1	0.95	+0.05	Small positive
Correct positive, uncertain	1	0.60	+0.40	Large positive
Correct negative, confident	0	0.05	-0.05	Small negative
Correct negative, uncertain	0	0.40	-0.40	Large negative
Misclassified positive	1	0.10	+0.90	Very large positive
Misclassified negative	0	0.90	-0.90	Very large negative

Comparison with Linear Regression Gradient:

In linear regression with squared error loss:

$$\nabla_\theta \mathcal{J}_{\text{linear}} = -\frac{2}{n} \mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\theta})$$

The structure is identical! Both gradients are $\mathbf{X}^T \times \text{residual}$. The difference is how the prediction is computed:

Linear regression: $\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\theta}$ (linear)
Logistic regression: $\hat{\boldsymbol{\pi}} = \sigma(\mathbf{X}\boldsymbol{\theta})$ (nonlinear)

This unified perspective is the foundation of Generalized Linear Models (GLMs).

The GLM Pattern

Gradient Verification via Finite Differences

When implementing gradients, bugs are common and can be subtle. A crucial debugging technique is gradient checking via finite differences.

Finite Difference Approximation:

The derivative of a function at a point can be approximated by:

$$\frac{\partial f}{\partial \theta_j} \approx \frac{f(\boldsymbol{\theta} + \epsilon \mathbf{e}_j) - f(\boldsymbol{\theta} - \epsilon \mathbf{e}_j)}{2\epsilon}$$

where $\mathbf{e}_j$ is the $j$-th standard basis vector and $\epsilon$ is a small perturbation (typically $10^{-5}$ to $10^{-7}$).

This central difference approximation has error $O(\epsilon^2)$, much better than the one-sided $O(\epsilon)$.

Why This Matters:

Gradient Checking
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def log_likelihood(theta, X, y):
    """Compute log-likelihood."""
    z = X @ theta
    # Stable computation
    log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
    return np.sum(y * z - log_partition)
 
def gradient_analytical(theta, X, y):
    """Analytical gradient: X^T (y - π)"""
    z = X @ theta
    pi = sigmoid(z)
    return X.T @ (y - pi)
 
def gradient_numerical(theta, X, y, epsilon=1e-5):
    """
    Numerical gradient via central differences.
    
    For each parameter j:
    ∂ℓ/∂θⱼ ≈ [ℓ(θ + ε eⱼ) - ℓ(θ - ε eⱼ)] / (2ε)
    """
    grad = np.zeros_like(theta)
    
    for j in range(len(theta)):
        # Add epsilon to parameter j
        theta_plus = theta.copy()
        theta_plus[j] += epsilon
        
        # Subtract epsilon from parameter j
        theta_minus = theta.copy()
        theta_minus[j] -= epsilon
        
        # Central difference
        grad[j] = (log_likelihood(theta_plus, X, y) - 
                   log_likelihood(theta_minus, X, y)) / (2 * epsilon)
    
    return grad
 
def check_gradient(theta, X, y, epsilon=1e-5, tolerance=1e-5):
    """
    Compare analytical and numerical gradients.
    Returns True if they match within tolerance.
    """
    grad_ana = gradient_analytical(theta, X, y)
    grad_num = gradient_numerical(theta, X, y, epsilon)
    
    # Relative error (handling near-zero components)
    diff = np.abs(grad_ana - grad_num)
    denom = np.maximum(np.abs(grad_ana) + np.abs(grad_num), 1e-8)
    relative_error = diff / denom
    max_error = np.max(relative_error)
    
    print("Gradient Check:")
    print(f"  Analytical: {grad_ana}")
    print(f"  Numerical:  {grad_num}")
    print(f"  Max relative error: {max_error:.2e}")
    print(f"  Status: {'PASS ✓' if max_error < tolerance else 'FAIL ✗'}")
    
    return max_error < tolerance
 
# Test gradient implementation
np.random.seed(42)
n, p = 50, 4
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.random.randn(p)
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
# Check at multiple random points
print("=== Gradient Verification ===\n")
for i in range(3):
    theta_test = np.random.randn(p)
    print(f"Test point {i+1}:")
    check_gradient(theta_test, X, y)
    print()

Always Verify Your Gradients

Connection to Neural Network Backpropagation

Logistic regression is a single-layer neural network with sigmoid activation and cross-entropy loss. The gradient derivation we just performed is exactly backpropagation for this simple case.

The Computation Graph:

$$\boldsymbol{\theta} \xrightarrow{\text{linear}} \mathbf{z} = \mathbf{X}\boldsymbol{\theta} \xrightarrow{\sigma} \boldsymbol{\pi} \xrightarrow{-\log} \ell$$

Forward Pass:

Compute linear combination: $z_i = \boldsymbol{\theta}^T \mathbf{x}_i$
Apply sigmoid activation: $\pi_i = \sigma(z_i)$
Compute loss: $\ell = \sum_i [y_i \log \pi_i + (1-y_i)\log(1-\pi_i)]$

Backward Pass (Backpropagation):

Loss → Probability: $\frac{\partial \ell}{\partial \pi_i} = \frac{y_i - \pi_i}{\pi_i(1-\pi_i)}$
Probability → Linear: $\frac{\partial \pi_i}{\partial z_i} = \pi_i(1-\pi_i)$
Combine: $\frac{\partial \ell}{\partial z_i} = y_i - \pi_i$ (the $\pi_i(1-\pi_i)$ cancels!)
Linear → Parameters: $\frac{\partial z_i}{\partial \boldsymbol{\theta}} = \mathbf{x}_i$
Final: $\frac{\partial \ell}{\partial \boldsymbol{\theta}} = \sum_i (y_i - \pi_i) \mathbf{x}_i = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$

Key Backpropagation Insights

•Chain rule: Derivatives flow backward through the computation graph, multiplying at each step
•Local derivatives: Each node only needs its own local derivative; propagation handles the rest
•Reuse forward values: $\pi_i$ computed in forward pass is reused in backward pass for $\sigma'(z_i)$
•Cancellation: Cross-entropy + sigmoid gives clean gradient $(y - \pi)$—no numerical issues
•Generalization: Multi-layer networks apply the same principle recursively through all layers

Why the Sigmoid + Cross-Entropy Combination Is Special:

The remarkable cancellation $\frac{y - \pi}{\pi(1-\pi)} \cdot \pi(1-\pi) = y - \pi$ is not a coincidence. It arises because:

Cross-entropy is the natural loss for Bernoulli likelihood
Sigmoid is the natural (canonical) link for Bernoulli
Together, they give the canonical exponential family gradient

This is why modern neural networks use softmax (multi-class sigmoid) with cross-entropy loss: the gradients are numerically stable and computationally efficient.

Deep Learning Connection

Summary and Preview

We have derived the gradient of the log-likelihood for logistic regression from multiple perspectives. This gradient is the engine that drives optimization.

Key Takeaways

•Chain Rule Structure: The gradient flows through $\boldsymbol{\theta} \to z \to \pi \to \ell$, with each step contributing a factor
•Sigmoid Derivative: $\sigma'(z) = \sigma(z)(1-\sigma(z))$—expressed in terms of the forward pass value
•Beautiful Cancellation: The $\pi(1-\pi)$ terms cancel, yielding the clean gradient $(y_i - \pi_i)x_{ij}$
•Matrix Form: $\nabla_\theta \ell = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$—efficient for vectorized computation
•Residual-Feature Structure: The gradient is the dot product of residuals with feature columns
•Gradient Checking: Always verify analytical gradients with finite differences during implementation
•Backpropagation Foundation: This derivation is exactly backprop for a one-layer network with sigmoid and cross-entropy

What's Next:

Page Complete

3 / 5