Machine LearningLogistic Regression & GLMs

Maximum Likelihood Estimation for Logistic Regression

LevelIntermediate

Duration120 mins

TopicLogistic Regression & GLMs

5 / 5

Optimization Approach

From Theory to Computation

We have built the theoretical foundation for Maximum Likelihood Estimation in logistic regression:

Likelihood Function: The product of Bernoulli probabilities we seek to maximize
Log-Likelihood: The concave objective function that enables global optimization
Gradient: The direction of steepest ascent, $\nabla_\theta \ell = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$
No Closed Form: The sigmoid nonlinearity forces us to iterate

Now we must put this theory into practice. How do we actually compute the MLE? What algorithms converge efficiently? What are the computational tradeoffs?

This page develops the major optimization approaches for logistic regression, from basic gradient descent to sophisticated second-order methods. Understanding these algorithms is essential not just for logistic regression, but for the broader field of machine learning where similar optimization challenges arise constantly.

What You Will Learn

By the end of this page, you will understand gradient descent and its variants, master Newton-Raphson and its connection to IRLS, appreciate the computational tradeoffs between methods, and be able to implement these algorithms from scratch with confidence.

Gradient Descent (Gradient Ascent)

The simplest optimization algorithm is gradient descent (or gradient ascent, when maximizing). The idea is elementary: move in the direction of the gradient.

The Algorithm:

Since we want to maximize the log-likelihood, we use gradient ascent:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \eta \nabla_\theta \ell(\boldsymbol{\theta}^{(t)})$$

where $\eta > 0$ is the learning rate (step size).

Substituting the gradient we derived:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \eta \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi}^{(t)})$$

Why It Works:

The gradient $\nabla_\theta \ell$ points in the direction of steepest increase of $\ell$. By moving in this direction (with small enough $\eta$), we guarantee the log-likelihood increases:

$$\ell(\boldsymbol{\theta}^{(t+1)}) \geq \ell(\boldsymbol{\theta}^{(t)})$$

Since $\ell$ is concave, the only stationary point ($\nabla_\theta \ell = 0$) is the global maximum. Hence gradient ascent converges to the MLE.

Gradient Descent Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
 
def sigmoid(z):
    """Numerically stable sigmoid."""
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def log_likelihood(theta, X, y):
    """Compute log-likelihood."""
    z = X @ theta
    log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
    return np.sum(y * z - log_partition)
 
def gradient(theta, X, y):
    """Compute gradient of log-likelihood."""
    z = X @ theta
    pi = sigmoid(z)
    return X.T @ (y - pi)
 
def gradient_ascent(X, y, eta=0.1, max_iter=1000, tol=1e-6):
    """
    Gradient ascent for logistic regression.
    
    Parameters
    ----------
    X : ndarray (n, p+1)
        Design matrix with intercept column
    y : ndarray (n,)
        Binary labels
    eta : float
        Learning rate
    max_iter : int
        Maximum iterations
    tol : float
        Convergence tolerance on gradient norm
    
    Returns
    -------
    theta : ndarray (p+1,)
        MLE parameters
    history : dict
        Training history (log-likelihood, gradient norms)
    """
    n, p = X.shape
    theta = np.zeros(p)  # Initialize at origin
    
    history = {'log_likelihood': [], 'grad_norm': []}
    
    for iteration in range(max_iter):
        # Compute gradient
        grad = gradient(theta, X, y)
        grad_norm = np.linalg.norm(grad)
        
        # Store history
        ll = log_likelihood(theta, X, y)
        history['log_likelihood'].append(ll)
        history['grad_norm'].append(grad_norm)
        
        # Check convergence
        if grad_norm < tol:
            print(f"Converged at iteration {iteration}")
            break
        
        # Update parameters
        theta = theta + eta * grad
    
    return theta, history
 
# Example usage
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.array([0.5, 1.0, -0.5])
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
theta_gd, history = gradient_ascent(X, y, eta=0.1, max_iter=500)
print(f"\nGradient Descent estimate: {theta_gd}")
print(f"True parameters:           {theta_true}")
print(f"Final log-likelihood:      {history['log_likelihood'][-1]:.4f}")

Gradient Descent Properties

•Convergence Rate: Linear (geometric) — error decreases by a constant factor per iteration
•Iteration Cost: $O(np)$ — one matrix-vector product and one gradient computation
•Learning Rate Sensitivity: Too large causes divergence; too small causes slow convergence
•Memory: $O(p)$ — only need to store current parameters and gradient
•Scalability: Good for large $n$, poor for ill-conditioned problems

The Learning Rate Problem

Choosing $\eta$ is tricky. Theory says $\eta < 2/L$ where $L$ is the Lipschitz constant of the gradient, but computing $L$ requires knowing properties of the Hessian. Line search (trying to find optimal $\eta$ per iteration) or adaptive methods (Adam, AdaGrad) are practical alternatives.

Newton-Raphson Method

Newton-Raphson (or simply Newton's method) uses second-order information (the Hessian) to achieve faster convergence.

The Idea:

Instead of moving a fixed step in the gradient direction, Newton's method approximates the objective locally with a quadratic function and jumps directly to the maximum of that approximation.

Second-Order Taylor Expansion:

Near current point $\boldsymbol{\theta}^{(t)}$:

$$\ell(\boldsymbol{\theta}) \approx \ell(\boldsymbol{\theta}^{(t)}) + \nabla_\theta \ell^T (\boldsymbol{\theta} - \boldsymbol{\theta}^{(t)}) + \frac{1}{2}(\boldsymbol{\theta} - \boldsymbol{\theta}^{(t)})^T \mathbf{H} (\boldsymbol{\theta} - \boldsymbol{\theta}^{(t)})$$

where $\mathbf{H} = \nabla^2_\theta \ell = -\mathbf{X}^T\mathbf{W}\mathbf{X}$ is the Hessian.

Maximizing the Quadratic Approximation:

Setting the derivative of the approximation to zero:

$$\nabla_\theta \ell + \mathbf{H}(\boldsymbol{\theta} - \boldsymbol{\theta}^{(t)}) = 0$$

Solving for $\boldsymbol{\theta}$:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \mathbf{H}^{-1} \nabla_\theta \ell$$

Since we're maximizing a concave function, $\mathbf{H}$ is negative definite, so $-\mathbf{H}^{-1}$ is positive definite and the update moves toward the maximum.

For Logistic Regression:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + (\mathbf{X}^T\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi}^{(t)})$$

where $\mathbf{W}^{(t)} = \text{diag}(\pi_1^{(t)}(1-\pi_1^{(t)}), \ldots, \pi_n^{(t)}(1-\pi_n^{(t)}))$.

Newton-Raphson Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def newton_raphson(X, y, max_iter=25, tol=1e-8):
    """
    Newton-Raphson for logistic regression.
    
    Update: θ^(t+1) = θ^(t) + (X^T W X)^{-1} X^T (y - π)
    
    where W = diag(π_i(1-π_i))
    
    Returns
    -------
    theta : MLE parameters
    history : dict with convergence info
    """
    n, p = X.shape
    theta = np.zeros(p)
    
    history = {'log_likelihood': [], 'grad_norm': [], 'step_norm': []}
    
    for iteration in range(max_iter):
        # Forward pass
        z = X @ theta
        pi = sigmoid(z)
        
        # Gradient
        residual = y - pi
        grad = X.T @ residual
        grad_norm = np.linalg.norm(grad)
        
        # Log-likelihood
        log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
        ll = np.sum(y * z - log_partition)
        history['log_likelihood'].append(ll)
        history['grad_norm'].append(grad_norm)
        
        # Check convergence
        if grad_norm < tol:
            print(f"Newton-Raphson converged at iteration {iteration}")
            break
        
        # Weight matrix W = diag(π_i(1-π_i))
        w = pi * (1 - pi)
        w = np.maximum(w, 1e-10)  # Numerical stability
        
        # Hessian: -X^T W X  (negative definite)
        # We need (X^T W X)^{-1} = (-H)^{-1}
        XtWX = X.T @ (w[:, np.newaxis] * X)  # Efficient weighted multiply
        
        # Newton step: (X^T W X)^{-1} X^T (y - π)
        try:
            step = np.linalg.solve(XtWX, grad)
        except np.linalg.LinAlgError:
            print("Hessian singular; adding regularization")
            step = np.linalg.solve(XtWX + 1e-6 * np.eye(p), grad)
        
        history['step_norm'].append(np.linalg.norm(step))
        
        # Update
        theta = theta + step
    
    return theta, history
 
# Compare with gradient descent
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.array([0.5, 1.0, -0.5])
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
theta_nr, history_nr = newton_raphson(X, y)
print(f"\nNewton-Raphson estimate: {theta_nr}")
print(f"True parameters:         {theta_true}")
print(f"Iterations:              {len(history_nr['log_likelihood'])}")
print(f"Final gradient norm:     {history_nr['grad_norm'][-1]:.2e}")

Gradient Descent vs Newton-Raphson
Property	Gradient Descent	Newton-Raphson
Convergence Rate	Linear: $\|\theta^{(t)} - \theta^*\| \leq c^t$	Quadratic: $\|\theta^{(t)} - \theta^\| \leq c \cdot \|\theta^{(t-1)} - \theta^\|^2$
Iterations to Converge	Many (100s-1000s)	Few (5-15 typical)
Cost per Iteration	$O(np)$	$O(np^2 + p^3)$ for matrix solve
Memory	$O(p)$	$O(p^2)$ for Hessian
Hyperparameters	Learning rate $\eta$	None (self-scaling)
Robustness	Sensitive to $\eta$	Can fail if Hessian singular

Quadratic Convergence Is Powerful

Quadratic convergence means the number of correct digits roughly doubles each iteration. If iteration 5 gives 3 correct digits, iteration 6 gives ~6, iteration 7 gives ~12. Newton typically converges in 5-10 iterations regardless of problem size, while gradient descent may need thousands.

Iteratively Reweighted Least Squares (IRLS)

IRLS is Newton-Raphson in disguise, reformulated as a sequence of weighted least squares problems. This perspective provides intuition and connects to familiar linear regression methods.

The IRLS Formulation:

Rearranging the Newton-Raphson update:

$$\boldsymbol{\theta}^{(t+1)} = (\mathbf{X}^T\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}^{(t)}\tilde{\mathbf{y}}^{(t)}$$

where the working response is:

$$\tilde{y}_i^{(t)} = z_i^{(t)} + \frac{y_i - \pi_i^{(t)}}{\pi_i^{(t)}(1-\pi_i^{(t)})}$$

and the working weight is:

$$w_i^{(t)} = \pi_i^{(t)}(1-\pi_i^{(t)})$$

Interpretation:

Each IRLS iteration solves a weighted least squares problem:

$$\boldsymbol{\theta}^{(t+1)} = \arg\min_\theta \sum_{i=1}^{n} w_i^{(t)} (\tilde{y}_i^{(t)} - \boldsymbol{\theta}^T\mathbf{x}_i)^2$$

The weights $w_i$ and responses $\tilde{y}_i$ depend on the previous iteration's predictions.

Why "Reweighted"?

Observations with $\pi_i \approx 0.5$ have high weight (high variance = contains most information)
Observations with $\pi_i \approx 0$ or $1$ have low weight (low variance = less informative)
Weights change each iteration as predictions update

IRLS Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def irls(X, y, max_iter=25, tol=1e-8):
    """
    Iteratively Reweighted Least Squares for logistic regression.
    
    Each iteration solves:
    θ = (X^T W X)^{-1} X^T W ỹ
    
    where:
    - W = diag(π_i(1-π_i))  [working weights]
    - ỹ_i = z_i + (y_i - π_i) / (π_i(1-π_i))  [working response]
    
    This is equivalent to Newton-Raphson.
    """
    n, p = X.shape
    theta = np.zeros(p)
    
    history = {'log_likelihood': [], 'theta': [theta.copy()]}
    
    for iteration in range(max_iter):
        # Forward pass
        z = X @ theta
        pi = sigmoid(z)
        
        # Clip for numerical stability
        pi = np.clip(pi, 1e-10, 1 - 1e-10)
        
        # Working weights: w_i = π_i(1-π_i)
        w = pi * (1 - pi)
        
        # Working response: ỹ_i = z_i + (y_i - π_i) / w_i
        working_response = z + (y - pi) / w
        
        # Weighted least squares: θ = (X^T W X)^{-1} X^T W ỹ
        W_sqrt = np.sqrt(w)
        X_weighted = X * W_sqrt[:, np.newaxis]
        y_weighted = working_response * W_sqrt
        
        # Solve via QR decomposition for stability
        Q, R = np.linalg.qr(X_weighted)
        theta_new = np.linalg.solve(R, Q.T @ y_weighted)
        
        # Check convergence
        if np.linalg.norm(theta_new - theta) < tol:
            print(f"IRLS converged at iteration {iteration}")
            theta = theta_new
            break
        
        theta = theta_new
        
        # Log-likelihood
        z = X @ theta
        log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
        ll = np.sum(y * z - log_partition)
        history['log_likelihood'].append(ll)
        history['theta'].append(theta.copy())
    
    return theta, history
 
# Demonstrate IRLS
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.array([0.5, 1.0, -0.5])
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
theta_irls, history_irls = irls(X, y)
print(f"\nIRLS estimate: {theta_irls}")
print(f"True params:   {theta_true}")
print(f"Iterations:    {len(history_irls['log_likelihood'])}")

IRLS Is Canonical for GLMs

The IRLS algorithm generalizes naturally to all Generalized Linear Models with canonical links. The form is the same: working weights from the variance function, working response from the link. This is implemented in R's glm() function and Python's statsmodels. Understanding IRLS for logistic regression teaches you the pattern for Poisson, negative binomial, and other GLMs.

Stochastic Gradient Descent

For massive datasets where computing the full gradient $\mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$ is expensive, Stochastic Gradient Descent (SGD) offers a scalable alternative.

The Idea:

Instead of computing the exact gradient using all $n$ observations, use a noisy estimate from a small subset (minibatch):

$$\nabla_\theta \ell \approx \frac{n}{|B|} \sum_{i \in B} (y_i - \pi_i) \mathbf{x}_i$$

where $B$ is a randomly selected minibatch of size $|B| \ll n$.

The SGD Update:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \eta_t \cdot \frac{n}{|B|} \sum_{i \in B} (y_i - \pi_i^{(t)}) \mathbf{x}_i$$

Key Properties:

SGD Characteristics

•Unbiased: $\mathbb{E}[\text{SGD gradient}] = \text{true gradient}$ (expectation over minibatch selection)
•Noisy: Variance decreases as $O(1/|B|)$ with minibatch size
•Cheap per iteration: $O(|B| \cdot p)$ instead of $O(np)$
•Many iterations: Need more passes through data than batch gradient descent
•Decreasing learning rate: Must use $\eta_t \to 0$ (e.g., $\eta_t = \eta_0 / t$) to guarantee convergence
•Implicit regularization: The noise acts as a regularizer, often improving generalization

SGD Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def sgd_logistic(X, y, batch_size=32, epochs=10, eta0=0.1, 
                 decay='inverse', random_state=None):
    """
    Stochastic Gradient Descent for logistic regression.
    
    Parameters
    ----------
    batch_size : int
        Minibatch size
    epochs : int
        Number of passes through the data
    eta0 : float
        Initial learning rate
    decay : str
        Learning rate schedule: 'constant', 'inverse', 'sqrt'
    """
    if random_state is not None:
        np.random.seed(random_state)
    
    n, p = X.shape
    theta = np.zeros(p)
    
    history = {'log_likelihood': []}
    t = 0  # Iteration counter
    
    for epoch in range(epochs):
        # Shuffle data at each epoch
        indices = np.random.permutation(n)
        
        for start in range(0, n, batch_size):
            # Get minibatch
            batch_idx = indices[start:start + batch_size]
            X_batch = X[batch_idx]
            y_batch = y[batch_idx]
            
            # Compute minibatch gradient
            z_batch = X_batch @ theta
            pi_batch = sigmoid(z_batch)
            grad = X_batch.T @ (y_batch - pi_batch)
            
            # Scale gradient (optional: makes learning rate more interpretable)
            # grad = grad * (n / len(batch_idx))
            
            # Learning rate schedule
            t += 1
            if decay == 'constant':
                eta = eta0
            elif decay == 'inverse':
                eta = eta0 / t
            elif decay == 'sqrt':
                eta = eta0 / np.sqrt(t)
            else:
                eta = eta0
            
            # Update
            theta = theta + eta * grad
        
        # Compute full log-likelihood at end of epoch
        z = X @ theta
        log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
        ll = np.sum(y * z - log_partition)
        history['log_likelihood'].append(ll)
    
    return theta, history
 
# Demo
np.random.seed(42)
n, p = 1000, 5
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.random.randn(p)
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
theta_sgd, history_sgd = sgd_logistic(X, y, batch_size=64, epochs=50, 
                                       eta0=0.5, decay='sqrt', random_state=42)
print(f"SGD estimate: {theta_sgd}")
print(f"True params:  {theta_true}")

When to Use SGD

Use SGD when $n$ is very large (millions+) and computing the full gradient is impractical. For moderate $n$ (thousands to hundreds of thousands), batch methods (Newton, L-BFGS) are often faster to converge. SGD shines when you'd rather make progress on a cheap approximation than wait for an expensive exact computation.

Quasi-Newton Methods (L-BFGS)

Quasi-Newton methods approximate the Hessian (or its inverse) using gradient information, achieving near-Newton convergence without the $O(p^3)$ matrix inversion cost.

The Challenge with Newton:

Newton's method requires computing and inverting $\mathbf{H} = -\mathbf{X}^T\mathbf{W}\mathbf{X}$, a $p \times p$ matrix. For large $p$ (thousands of features), this is prohibitive in both time ($O(p^3)$) and memory ($O(p^2)$).

BFGS (Broyden-Fletcher-Goldfarb-Shanno):

BFGS maintains an approximation $\mathbf{B}$ to the inverse Hessian, updated each iteration using gradient differences:

$$\mathbf{s}^{(t)} = \boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}$$ $$\mathbf{g}^{(t)} = \nabla_\theta \ell(\boldsymbol{\theta}^{(t+1)}) - \nabla_\theta \ell(\boldsymbol{\theta}^{(t)})$$

The update ensures the approximation satisfies the secant condition: $\mathbf{B}^{(t+1)}\mathbf{g}^{(t)} = \mathbf{s}^{(t)}$.

L-BFGS (Limited-memory BFGS):

Instead of storing the full $p \times p$ matrix, L-BFGS stores only the last $m$ pairs $(\mathbf{s}^{(t)}, \mathbf{g}^{(t)})$ and computes the approximate Hessian-vector product implicitly.

Memory: $O(mp)$ instead of $O(p^2)$, where $m$ is typically 5-20.

Second-Order Method Comparison
Method	Hessian	Memory	Convergence
Newton	Exact, $O(p^3)$ compute	$O(p^2)$	Quadratic
BFGS	Approximate, $O(p^2)$ update	$O(p^2)$	Superlinear
L-BFGS	Implicit, $O(mp)$ compute	$O(mp)$	Superlinear
Gradient Descent	None (identity)	$O(p)$	Linear

Using L-BFGS from scipy
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
from scipy.optimize import minimize
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def neg_log_likelihood(theta, X, y):
    """Negative log-likelihood (for minimization)."""
    z = X @ theta
    log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
    return -np.sum(y * z - log_partition)
 
def neg_gradient(theta, X, y):
    """Negative gradient (for minimization)."""
    z = X @ theta
    pi = sigmoid(z)
    return -X.T @ (y - pi)
 
# Example with L-BFGS-B
np.random.seed(42)
n, p = 1000, 50  # Large p where Newton would be expensive
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.random.randn(p) * 0.5
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
# L-BFGS optimization
result = minimize(
    fun=neg_log_likelihood,
    x0=np.zeros(p),
    args=(X, y),
    method='L-BFGS-B',
    jac=neg_gradient,
    options={'maxiter': 100, 'disp': True}
)
 
theta_lbfgs = result.x
print(f"\nL-BFGS converged: {result.success}")
print(f"Iterations: {result.nit}")
print(f"Function evaluations: {result.nfev}")
print(f"Gradient evaluations: {result.njev}")
print(f"Final loss: {result.fun:.4f}")

L-BFGS Is the Default Choice

For most practical logistic regression problems with moderate $n$ and $p$, L-BFGS is the best default. It's implemented in scipy, sklearn's LogisticRegression (solver='lbfgs'), and most ML libraries. It combines Newton-like convergence with gradient-descent-like memory usage.

Regularized Optimization

In practice, we often add regularization to prevent overfitting and ensure the MLE exists (even with separable data).

L2 Regularization (Ridge):

The regularized objective is:

$$\mathcal{J}(\boldsymbol{\theta}) = -\ell(\boldsymbol{\theta}) + \frac{\lambda}{2}|\boldsymbol{\theta}|_2^2$$

where $\lambda > 0$ is the regularization strength.

Modified Gradient:

$$\nabla_\theta \mathcal{J} = -\mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi}) + \lambda\boldsymbol{\theta}$$

Modified Hessian:

$$\nabla^2_\theta \mathcal{J} = \mathbf{X}^T\mathbf{W}\mathbf{X} + \lambda\mathbf{I}$$

The $\lambda\mathbf{I}$ term ensures the Hessian is always positive definite, avoiding singularity issues.

L1 Regularization (Lasso):

For L1 regularization, the objective is non-differentiable at $\theta_j = 0$:

$$\mathcal{J}(\boldsymbol{\theta}) = -\ell(\boldsymbol{\theta}) + \lambda|\boldsymbol{\theta}|_1$$

This requires specialized algorithms:

Coordinate Descent: Optimize one $\theta_j$ at a time (closed-form soft-thresholding update)
Proximal Gradient: Gradient step followed by soft-thresholding
ADMM: Alternating Direction Method of Multipliers

Regularized Logistic Regression
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
from scipy.optimize import minimize
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def objective_l2(theta, X, y, lambda_reg):
    """Negative log-likelihood + L2 penalty."""
    z = X @ theta
    log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
    nll = -np.sum(y * z - log_partition)
    l2_penalty = 0.5 * lambda_reg * np.sum(theta**2)
    return nll + l2_penalty
 
def gradient_l2(theta, X, y, lambda_reg):
    """Gradient of regularized objective."""
    z = X @ theta
    pi = sigmoid(z)
    grad_nll = -X.T @ (y - pi)
    grad_l2 = lambda_reg * theta
    return grad_nll + grad_l2
 
def fit_logistic_l2(X, y, lambda_reg=1.0):
    """Fit L2-regularized logistic regression."""
    n, p = X.shape
    result = minimize(
        fun=objective_l2,
        x0=np.zeros(p),
        args=(X, y, lambda_reg),
        method='L-BFGS-B',
        jac=gradient_l2
    )
    return result.x
 
# Demo with regularization
np.random.seed(42)
n, p = 100, 50  # More features than observations
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.zeros(p)
theta_true[:5] = np.array([0.5, 1.0, -0.5, 0.8, -0.3])  # Sparse true model
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
# Compare different regularization strengths
for lam in [0.01, 0.1, 1.0, 10.0]:
    theta_est = fit_logistic_l2(X, y, lambda_reg=lam)
    nonzero = np.sum(np.abs(theta_est) > 0.01)
    print(f"λ={lam:5.2f}: ||θ||₂ = {np.linalg.norm(theta_est):.3f}, "
          f"active features = {nonzero}")

Always Regularize in Practice

Unless you have strong theoretical reasons not to, always use regularization. It prevents numerical issues with separable data, improves generalization, and rarely hurts if $\lambda$ is chosen via cross-validation. sklearn's LogisticRegression uses L2 regularization by default.

Convergence Diagnostics

How do we know when optimization has converged? Several criteria are used in practice:

1. Gradient Norm:

$$|\nabla_\theta \ell| < \epsilon_{\text{grad}}$$

The gradient should be near zero at the optimum. Typical threshold: $10^{-6}$ to $10^{-8}$.

2. Parameter Change:

$$|\boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}| < \epsilon_{\theta}$$

Parameters should stabilize. Useful but can be misleading in flat regions.

3. Objective Change:

$$|\ell^{(t+1)} - \ell^{(t)}| < \epsilon_{\ell}$$

Or relative: $|\ell^{(t+1)} - \ell^{(t)}| / |\ell^{(t)}| < \epsilon$

4. Maximum Iterations:

Always include a maximum iteration count to prevent infinite loops.

Warning Signs:

Convergence Problems

•Non-decreasing log-likelihood: Sometimes indicates learning rate too large (gradient descent) or numerical issues
•Exploding parameters: Often means perfect separation—add regularization
•Oscillating objective: Learning rate too large; try reducing $\eta$
•Very slow progress: May indicate ill-conditioning; try different algorithm or preconditioner
•NaN or Inf values: Numerical overflow in sigmoid or log; check for extreme predictions

Best Practices:

Monitor multiple metrics (objective, gradient, parameters)
Plot convergence curves to diagnose issues visually
Use relative tolerances for objectives that vary in scale
Start with conservative settings, then tune for speed
Always verify the solution is reasonable (check predictions, coefficient magnitudes)

Convergence ≠ Correctness

A converged optimizer may still be wrong if the code has bugs, the data is corrupted, or the model is misspecified. Always sanity-check your results: Do coefficients have sensible signs and magnitudes? Does accuracy on held-out data match expectations? Do predictions cover the full range [0, 1]?

Summary and Conclusion

We've explored the major optimization approaches for logistic regression MLE. Each has its place in the practitioner's toolkit.

Optimization Algorithm Summary
Algorithm	Best For	Key Tradeoff
Gradient Descent	Learning, simple problems	Simple but slow
Newton-Raphson/IRLS	Small-medium $p$, high precision	Fast convergence, expensive iteration
L-BFGS	Medium-large $p$, general use	Good balance of speed and memory
SGD	Very large $n$, online learning	Scalable but noisy convergence
Coordinate Descent	L1 regularization, sparse models	Per-coordinate updates, feature selection

Key Takeaways

•Gradient Descent: Simple and robust; update $\theta \gets \theta + \eta \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$; linear convergence
•Newton-Raphson: Uses Hessian for quadratic convergence; update involves $(\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}$
•IRLS: Newton reformulated as weighted least squares; intuitive and connects to GLM software
•SGD: Approximate gradients from minibatches; scales to massive datasets with decreasing learning rate
•L-BFGS: Quasi-Newton with $O(mp)$ memory; the practical default for most problems
•Regularization: Add L2 penalty to ensure well-defined optimization; use specialized methods for L1
•Convergence: Monitor gradient norm, objective change, and parameter stability; watch for warning signs

Module Complete:

You have now mastered Maximum Likelihood Estimation for logistic regression from theory to practice:

The likelihood function as the product of Bernoulli probabilities
The log-likelihood transformation and its concavity
The gradient derivation via chain rule
Why no closed-form solution exists
The optimization algorithms that find the MLE efficiently

This knowledge extends far beyond logistic regression. The same principles—likelihood, gradient, iterative optimization—underpin neural networks, probabilistic graphical models, and much of modern machine learning.

Module Complete

Congratulations! You now have a complete, rigorous understanding of Maximum Likelihood Estimation for logistic regression. You can derive the likelihood and gradient from scratch, understand why iterative optimization is necessary, and implement multiple algorithms. This foundation prepares you for multi-class logistic regression, regularized methods, and the broader world of probabilistic machine learning.

5 / 5

Loading learning content...

Machine LearningLogistic Regression & GLMs

Maximum Likelihood Estimation for Logistic Regression

LevelIntermediate

Duration120 mins

TopicLogistic Regression & GLMs

5 / 5

Optimization Approach

From Theory to Computation

We have built the theoretical foundation for Maximum Likelihood Estimation in logistic regression:

Likelihood Function: The product of Bernoulli probabilities we seek to maximize
Log-Likelihood: The concave objective function that enables global optimization
Gradient: The direction of steepest ascent, $\nabla_\theta \ell = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$
No Closed Form: The sigmoid nonlinearity forces us to iterate

Now we must put this theory into practice. How do we actually compute the MLE? What algorithms converge efficiently? What are the computational tradeoffs?

What You Will Learn

Gradient Descent (Gradient Ascent)

The simplest optimization algorithm is gradient descent (or gradient ascent, when maximizing). The idea is elementary: move in the direction of the gradient.

The Algorithm:

Since we want to maximize the log-likelihood, we use gradient ascent:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \eta \nabla_\theta \ell(\boldsymbol{\theta}^{(t)})$$

where $\eta > 0$ is the learning rate (step size).

Substituting the gradient we derived:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \eta \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi}^{(t)})$$

Why It Works:

The gradient $\nabla_\theta \ell$ points in the direction of steepest increase of $\ell$. By moving in this direction (with small enough $\eta$), we guarantee the log-likelihood increases:

$$\ell(\boldsymbol{\theta}^{(t+1)}) \geq \ell(\boldsymbol{\theta}^{(t)})$$

Since $\ell$ is concave, the only stationary point ($\nabla_\theta \ell = 0$) is the global maximum. Hence gradient ascent converges to the MLE.

Gradient Descent Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
 
def sigmoid(z):
    """Numerically stable sigmoid."""
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def log_likelihood(theta, X, y):
    """Compute log-likelihood."""
    z = X @ theta
    log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
    return np.sum(y * z - log_partition)
 
def gradient(theta, X, y):
    """Compute gradient of log-likelihood."""
    z = X @ theta
    pi = sigmoid(z)
    return X.T @ (y - pi)
 
def gradient_ascent(X, y, eta=0.1, max_iter=1000, tol=1e-6):
    """
    Gradient ascent for logistic regression.
    
    Parameters
    ----------
    X : ndarray (n, p+1)
        Design matrix with intercept column
    y : ndarray (n,)
        Binary labels
    eta : float
        Learning rate
    max_iter : int
        Maximum iterations
    tol : float
        Convergence tolerance on gradient norm
    
    Returns
    -------
    theta : ndarray (p+1,)
        MLE parameters
    history : dict
        Training history (log-likelihood, gradient norms)
    """
    n, p = X.shape
    theta = np.zeros(p)  # Initialize at origin
    
    history = {'log_likelihood': [], 'grad_norm': []}
    
    for iteration in range(max_iter):
        # Compute gradient
        grad = gradient(theta, X, y)
        grad_norm = np.linalg.norm(grad)
        
        # Store history
        ll = log_likelihood(theta, X, y)
        history['log_likelihood'].append(ll)
        history['grad_norm'].append(grad_norm)
        
        # Check convergence
        if grad_norm < tol:
            print(f"Converged at iteration {iteration}")
            break
        
        # Update parameters
        theta = theta + eta * grad
    
    return theta, history
 
# Example usage
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.array([0.5, 1.0, -0.5])
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
theta_gd, history = gradient_ascent(X, y, eta=0.1, max_iter=500)
print(f"\nGradient Descent estimate: {theta_gd}")
print(f"True parameters:           {theta_true}")
print(f"Final log-likelihood:      {history['log_likelihood'][-1]:.4f}")

Gradient Descent Properties

•Convergence Rate: Linear (geometric) — error decreases by a constant factor per iteration
•Iteration Cost: $O(np)$ — one matrix-vector product and one gradient computation
•Learning Rate Sensitivity: Too large causes divergence; too small causes slow convergence
•Memory: $O(p)$ — only need to store current parameters and gradient
•Scalability: Good for large $n$, poor for ill-conditioned problems

The Learning Rate Problem

Newton-Raphson Method

Newton-Raphson (or simply Newton's method) uses second-order information (the Hessian) to achieve faster convergence.

The Idea:

Instead of moving a fixed step in the gradient direction, Newton's method approximates the objective locally with a quadratic function and jumps directly to the maximum of that approximation.

Second-Order Taylor Expansion:

Near current point $\boldsymbol{\theta}^{(t)}$:

where $\mathbf{H} = \nabla^2_\theta \ell = -\mathbf{X}^T\mathbf{W}\mathbf{X}$ is the Hessian.

Maximizing the Quadratic Approximation:

Setting the derivative of the approximation to zero:

$$\nabla_\theta \ell + \mathbf{H}(\boldsymbol{\theta} - \boldsymbol{\theta}^{(t)}) = 0$$

Solving for $\boldsymbol{\theta}$:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \mathbf{H}^{-1} \nabla_\theta \ell$$

Since we're maximizing a concave function, $\mathbf{H}$ is negative definite, so $-\mathbf{H}^{-1}$ is positive definite and the update moves toward the maximum.

For Logistic Regression:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + (\mathbf{X}^T\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi}^{(t)})$$

where $\mathbf{W}^{(t)} = \text{diag}(\pi_1^{(t)}(1-\pi_1^{(t)}), \ldots, \pi_n^{(t)}(1-\pi_n^{(t)}))$.

Newton-Raphson Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def newton_raphson(X, y, max_iter=25, tol=1e-8):
    """
    Newton-Raphson for logistic regression.
    
    Update: θ^(t+1) = θ^(t) + (X^T W X)^{-1} X^T (y - π)
    
    where W = diag(π_i(1-π_i))
    
    Returns
    -------
    theta : MLE parameters
    history : dict with convergence info
    """
    n, p = X.shape
    theta = np.zeros(p)
    
    history = {'log_likelihood': [], 'grad_norm': [], 'step_norm': []}
    
    for iteration in range(max_iter):
        # Forward pass
        z = X @ theta
        pi = sigmoid(z)
        
        # Gradient
        residual = y - pi
        grad = X.T @ residual
        grad_norm = np.linalg.norm(grad)
        
        # Log-likelihood
        log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
        ll = np.sum(y * z - log_partition)
        history['log_likelihood'].append(ll)
        history['grad_norm'].append(grad_norm)
        
        # Check convergence
        if grad_norm < tol:
            print(f"Newton-Raphson converged at iteration {iteration}")
            break
        
        # Weight matrix W = diag(π_i(1-π_i))
        w = pi * (1 - pi)
        w = np.maximum(w, 1e-10)  # Numerical stability
        
        # Hessian: -X^T W X  (negative definite)
        # We need (X^T W X)^{-1} = (-H)^{-1}
        XtWX = X.T @ (w[:, np.newaxis] * X)  # Efficient weighted multiply
        
        # Newton step: (X^T W X)^{-1} X^T (y - π)
        try:
            step = np.linalg.solve(XtWX, grad)
        except np.linalg.LinAlgError:
            print("Hessian singular; adding regularization")
            step = np.linalg.solve(XtWX + 1e-6 * np.eye(p), grad)
        
        history['step_norm'].append(np.linalg.norm(step))
        
        # Update
        theta = theta + step
    
    return theta, history
 
# Compare with gradient descent
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.array([0.5, 1.0, -0.5])
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
theta_nr, history_nr = newton_raphson(X, y)
print(f"\nNewton-Raphson estimate: {theta_nr}")
print(f"True parameters:         {theta_true}")
print(f"Iterations:              {len(history_nr['log_likelihood'])}")
print(f"Final gradient norm:     {history_nr['grad_norm'][-1]:.2e}")

Gradient Descent vs Newton-Raphson
Property	Gradient Descent	Newton-Raphson
Convergence Rate	Linear: $\|\theta^{(t)} - \theta^*\| \leq c^t$	Quadratic: $\|\theta^{(t)} - \theta^\| \leq c \cdot \|\theta^{(t-1)} - \theta^\|^2$
Iterations to Converge	Many (100s-1000s)	Few (5-15 typical)
Cost per Iteration	$O(np)$	$O(np^2 + p^3)$ for matrix solve
Memory	$O(p)$	$O(p^2)$ for Hessian
Hyperparameters	Learning rate $\eta$	None (self-scaling)
Robustness	Sensitive to $\eta$	Can fail if Hessian singular

Quadratic Convergence Is Powerful

Iteratively Reweighted Least Squares (IRLS)

IRLS is Newton-Raphson in disguise, reformulated as a sequence of weighted least squares problems. This perspective provides intuition and connects to familiar linear regression methods.

The IRLS Formulation:

Rearranging the Newton-Raphson update:

$$\boldsymbol{\theta}^{(t+1)} = (\mathbf{X}^T\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}^{(t)}\tilde{\mathbf{y}}^{(t)}$$

where the working response is:

$$\tilde{y}_i^{(t)} = z_i^{(t)} + \frac{y_i - \pi_i^{(t)}}{\pi_i^{(t)}(1-\pi_i^{(t)})}$$

and the working weight is:

$$w_i^{(t)} = \pi_i^{(t)}(1-\pi_i^{(t)})$$

Interpretation:

Each IRLS iteration solves a weighted least squares problem:

$$\boldsymbol{\theta}^{(t+1)} = \arg\min_\theta \sum_{i=1}^{n} w_i^{(t)} (\tilde{y}_i^{(t)} - \boldsymbol{\theta}^T\mathbf{x}_i)^2$$

The weights $w_i$ and responses $\tilde{y}_i$ depend on the previous iteration's predictions.

Why "Reweighted"?

Observations with $\pi_i \approx 0.5$ have high weight (high variance = contains most information)
Observations with $\pi_i \approx 0$ or $1$ have low weight (low variance = less informative)
Weights change each iteration as predictions update

IRLS Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def irls(X, y, max_iter=25, tol=1e-8):
    """
    Iteratively Reweighted Least Squares for logistic regression.
    
    Each iteration solves:
    θ = (X^T W X)^{-1} X^T W ỹ
    
    where:
    - W = diag(π_i(1-π_i))  [working weights]
    - ỹ_i = z_i + (y_i - π_i) / (π_i(1-π_i))  [working response]
    
    This is equivalent to Newton-Raphson.
    """
    n, p = X.shape
    theta = np.zeros(p)
    
    history = {'log_likelihood': [], 'theta': [theta.copy()]}
    
    for iteration in range(max_iter):
        # Forward pass
        z = X @ theta
        pi = sigmoid(z)
        
        # Clip for numerical stability
        pi = np.clip(pi, 1e-10, 1 - 1e-10)
        
        # Working weights: w_i = π_i(1-π_i)
        w = pi * (1 - pi)
        
        # Working response: ỹ_i = z_i + (y_i - π_i) / w_i
        working_response = z + (y - pi) / w
        
        # Weighted least squares: θ = (X^T W X)^{-1} X^T W ỹ
        W_sqrt = np.sqrt(w)
        X_weighted = X * W_sqrt[:, np.newaxis]
        y_weighted = working_response * W_sqrt
        
        # Solve via QR decomposition for stability
        Q, R = np.linalg.qr(X_weighted)
        theta_new = np.linalg.solve(R, Q.T @ y_weighted)
        
        # Check convergence
        if np.linalg.norm(theta_new - theta) < tol:
            print(f"IRLS converged at iteration {iteration}")
            theta = theta_new
            break
        
        theta = theta_new
        
        # Log-likelihood
        z = X @ theta
        log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
        ll = np.sum(y * z - log_partition)
        history['log_likelihood'].append(ll)
        history['theta'].append(theta.copy())
    
    return theta, history
 
# Demonstrate IRLS
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.array([0.5, 1.0, -0.5])
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
theta_irls, history_irls = irls(X, y)
print(f"\nIRLS estimate: {theta_irls}")
print(f"True params:   {theta_true}")
print(f"Iterations:    {len(history_irls['log_likelihood'])}")

IRLS Is Canonical for GLMs

Stochastic Gradient Descent

For massive datasets where computing the full gradient $\mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$ is expensive, Stochastic Gradient Descent (SGD) offers a scalable alternative.

The Idea:

Instead of computing the exact gradient using all $n$ observations, use a noisy estimate from a small subset (minibatch):

$$\nabla_\theta \ell \approx \frac{n}{|B|} \sum_{i \in B} (y_i - \pi_i) \mathbf{x}_i$$

where $B$ is a randomly selected minibatch of size $|B| \ll n$.

The SGD Update:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \eta_t \cdot \frac{n}{|B|} \sum_{i \in B} (y_i - \pi_i^{(t)}) \mathbf{x}_i$$

Key Properties:

SGD Characteristics

•Unbiased: $\mathbb{E}[\text{SGD gradient}] = \text{true gradient}$ (expectation over minibatch selection)
•Noisy: Variance decreases as $O(1/|B|)$ with minibatch size
•Cheap per iteration: $O(|B| \cdot p)$ instead of $O(np)$
•Many iterations: Need more passes through data than batch gradient descent
•Decreasing learning rate: Must use $\eta_t \to 0$ (e.g., $\eta_t = \eta_0 / t$) to guarantee convergence
•Implicit regularization: The noise acts as a regularizer, often improving generalization

SGD Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def sgd_logistic(X, y, batch_size=32, epochs=10, eta0=0.1, 
                 decay='inverse', random_state=None):
    """
    Stochastic Gradient Descent for logistic regression.
    
    Parameters
    ----------
    batch_size : int
        Minibatch size
    epochs : int
        Number of passes through the data
    eta0 : float
        Initial learning rate
    decay : str
        Learning rate schedule: 'constant', 'inverse', 'sqrt'
    """
    if random_state is not None:
        np.random.seed(random_state)
    
    n, p = X.shape
    theta = np.zeros(p)
    
    history = {'log_likelihood': []}
    t = 0  # Iteration counter
    
    for epoch in range(epochs):
        # Shuffle data at each epoch
        indices = np.random.permutation(n)
        
        for start in range(0, n, batch_size):
            # Get minibatch
            batch_idx = indices[start:start + batch_size]
            X_batch = X[batch_idx]
            y_batch = y[batch_idx]
            
            # Compute minibatch gradient
            z_batch = X_batch @ theta
            pi_batch = sigmoid(z_batch)
            grad = X_batch.T @ (y_batch - pi_batch)
            
            # Scale gradient (optional: makes learning rate more interpretable)
            # grad = grad * (n / len(batch_idx))
            
            # Learning rate schedule
            t += 1
            if decay == 'constant':
                eta = eta0
            elif decay == 'inverse':
                eta = eta0 / t
            elif decay == 'sqrt':
                eta = eta0 / np.sqrt(t)
            else:
                eta = eta0
            
            # Update
            theta = theta + eta * grad
        
        # Compute full log-likelihood at end of epoch
        z = X @ theta
        log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
        ll = np.sum(y * z - log_partition)
        history['log_likelihood'].append(ll)
    
    return theta, history
 
# Demo
np.random.seed(42)
n, p = 1000, 5
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.random.randn(p)
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
theta_sgd, history_sgd = sgd_logistic(X, y, batch_size=64, epochs=50, 
                                       eta0=0.5, decay='sqrt', random_state=42)
print(f"SGD estimate: {theta_sgd}")
print(f"True params:  {theta_true}")

When to Use SGD

Quasi-Newton Methods (L-BFGS)

Quasi-Newton methods approximate the Hessian (or its inverse) using gradient information, achieving near-Newton convergence without the $O(p^3)$ matrix inversion cost.

The Challenge with Newton:

BFGS (Broyden-Fletcher-Goldfarb-Shanno):

BFGS maintains an approximation $\mathbf{B}$ to the inverse Hessian, updated each iteration using gradient differences:

$$\mathbf{s}^{(t)} = \boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}$$ $$\mathbf{g}^{(t)} = \nabla_\theta \ell(\boldsymbol{\theta}^{(t+1)}) - \nabla_\theta \ell(\boldsymbol{\theta}^{(t)})$$

The update ensures the approximation satisfies the secant condition: $\mathbf{B}^{(t+1)}\mathbf{g}^{(t)} = \mathbf{s}^{(t)}$.

L-BFGS (Limited-memory BFGS):

Instead of storing the full $p \times p$ matrix, L-BFGS stores only the last $m$ pairs $(\mathbf{s}^{(t)}, \mathbf{g}^{(t)})$ and computes the approximate Hessian-vector product implicitly.

Memory: $O(mp)$ instead of $O(p^2)$, where $m$ is typically 5-20.

Second-Order Method Comparison
Method	Hessian	Memory	Convergence
Newton	Exact, $O(p^3)$ compute	$O(p^2)$	Quadratic
BFGS	Approximate, $O(p^2)$ update	$O(p^2)$	Superlinear
L-BFGS	Implicit, $O(mp)$ compute	$O(mp)$	Superlinear
Gradient Descent	None (identity)	$O(p)$	Linear

Using L-BFGS from scipy
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
from scipy.optimize import minimize
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def neg_log_likelihood(theta, X, y):
    """Negative log-likelihood (for minimization)."""
    z = X @ theta
    log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
    return -np.sum(y * z - log_partition)
 
def neg_gradient(theta, X, y):
    """Negative gradient (for minimization)."""
    z = X @ theta
    pi = sigmoid(z)
    return -X.T @ (y - pi)
 
# Example with L-BFGS-B
np.random.seed(42)
n, p = 1000, 50  # Large p where Newton would be expensive
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.random.randn(p) * 0.5
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
# L-BFGS optimization
result = minimize(
    fun=neg_log_likelihood,
    x0=np.zeros(p),
    args=(X, y),
    method='L-BFGS-B',
    jac=neg_gradient,
    options={'maxiter': 100, 'disp': True}
)
 
theta_lbfgs = result.x
print(f"\nL-BFGS converged: {result.success}")
print(f"Iterations: {result.nit}")
print(f"Function evaluations: {result.nfev}")
print(f"Gradient evaluations: {result.njev}")
print(f"Final loss: {result.fun:.4f}")

L-BFGS Is the Default Choice

Regularized Optimization

In practice, we often add regularization to prevent overfitting and ensure the MLE exists (even with separable data).

L2 Regularization (Ridge):

The regularized objective is:

$$\mathcal{J}(\boldsymbol{\theta}) = -\ell(\boldsymbol{\theta}) + \frac{\lambda}{2}|\boldsymbol{\theta}|_2^2$$

where $\lambda > 0$ is the regularization strength.

Modified Gradient:

$$\nabla_\theta \mathcal{J} = -\mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi}) + \lambda\boldsymbol{\theta}$$

Modified Hessian:

$$\nabla^2_\theta \mathcal{J} = \mathbf{X}^T\mathbf{W}\mathbf{X} + \lambda\mathbf{I}$$

The $\lambda\mathbf{I}$ term ensures the Hessian is always positive definite, avoiding singularity issues.

L1 Regularization (Lasso):

For L1 regularization, the objective is non-differentiable at $\theta_j = 0$:

$$\mathcal{J}(\boldsymbol{\theta}) = -\ell(\boldsymbol{\theta}) + \lambda|\boldsymbol{\theta}|_1$$

This requires specialized algorithms:

Coordinate Descent: Optimize one $\theta_j$ at a time (closed-form soft-thresholding update)
Proximal Gradient: Gradient step followed by soft-thresholding
ADMM: Alternating Direction Method of Multipliers

Regularized Logistic Regression
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
from scipy.optimize import minimize
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def objective_l2(theta, X, y, lambda_reg):
    """Negative log-likelihood + L2 penalty."""
    z = X @ theta
    log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
    nll = -np.sum(y * z - log_partition)
    l2_penalty = 0.5 * lambda_reg * np.sum(theta**2)
    return nll + l2_penalty
 
def gradient_l2(theta, X, y, lambda_reg):
    """Gradient of regularized objective."""
    z = X @ theta
    pi = sigmoid(z)
    grad_nll = -X.T @ (y - pi)
    grad_l2 = lambda_reg * theta
    return grad_nll + grad_l2
 
def fit_logistic_l2(X, y, lambda_reg=1.0):
    """Fit L2-regularized logistic regression."""
    n, p = X.shape
    result = minimize(
        fun=objective_l2,
        x0=np.zeros(p),
        args=(X, y, lambda_reg),
        method='L-BFGS-B',
        jac=gradient_l2
    )
    return result.x
 
# Demo with regularization
np.random.seed(42)
n, p = 100, 50  # More features than observations
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.zeros(p)
theta_true[:5] = np.array([0.5, 1.0, -0.5, 0.8, -0.3])  # Sparse true model
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
# Compare different regularization strengths
for lam in [0.01, 0.1, 1.0, 10.0]:
    theta_est = fit_logistic_l2(X, y, lambda_reg=lam)
    nonzero = np.sum(np.abs(theta_est) > 0.01)
    print(f"λ={lam:5.2f}: ||θ||₂ = {np.linalg.norm(theta_est):.3f}, "
          f"active features = {nonzero}")

Always Regularize in Practice

Convergence Diagnostics

How do we know when optimization has converged? Several criteria are used in practice:

1. Gradient Norm:

$$|\nabla_\theta \ell| < \epsilon_{\text{grad}}$$

The gradient should be near zero at the optimum. Typical threshold: $10^{-6}$ to $10^{-8}$.

2. Parameter Change:

$$|\boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}| < \epsilon_{\theta}$$

Parameters should stabilize. Useful but can be misleading in flat regions.

3. Objective Change:

$$|\ell^{(t+1)} - \ell^{(t)}| < \epsilon_{\ell}$$

Or relative: $|\ell^{(t+1)} - \ell^{(t)}| / |\ell^{(t)}| < \epsilon$

4. Maximum Iterations:

Always include a maximum iteration count to prevent infinite loops.

Warning Signs:

Convergence Problems

•Non-decreasing log-likelihood: Sometimes indicates learning rate too large (gradient descent) or numerical issues
•Exploding parameters: Often means perfect separation—add regularization
•Oscillating objective: Learning rate too large; try reducing $\eta$
•Very slow progress: May indicate ill-conditioning; try different algorithm or preconditioner
•NaN or Inf values: Numerical overflow in sigmoid or log; check for extreme predictions

Best Practices:

Monitor multiple metrics (objective, gradient, parameters)
Plot convergence curves to diagnose issues visually
Use relative tolerances for objectives that vary in scale
Start with conservative settings, then tune for speed
Always verify the solution is reasonable (check predictions, coefficient magnitudes)

Convergence ≠ Correctness

Summary and Conclusion

We've explored the major optimization approaches for logistic regression MLE. Each has its place in the practitioner's toolkit.

Optimization Algorithm Summary
Algorithm	Best For	Key Tradeoff
Gradient Descent	Learning, simple problems	Simple but slow
Newton-Raphson/IRLS	Small-medium $p$, high precision	Fast convergence, expensive iteration
L-BFGS	Medium-large $p$, general use	Good balance of speed and memory
SGD	Very large $n$, online learning	Scalable but noisy convergence
Coordinate Descent	L1 regularization, sparse models	Per-coordinate updates, feature selection

Key Takeaways

•Gradient Descent: Simple and robust; update $\theta \gets \theta + \eta \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$; linear convergence
•Newton-Raphson: Uses Hessian for quadratic convergence; update involves $(\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}$
•IRLS: Newton reformulated as weighted least squares; intuitive and connects to GLM software
•SGD: Approximate gradients from minibatches; scales to massive datasets with decreasing learning rate
•L-BFGS: Quasi-Newton with $O(mp)$ memory; the practical default for most problems
•Regularization: Add L2 penalty to ensure well-defined optimization; use specialized methods for L1
•Convergence: Monitor gradient norm, objective change, and parameter stability; watch for warning signs

Module Complete:

You have now mastered Maximum Likelihood Estimation for logistic regression from theory to practice:

The likelihood function as the product of Bernoulli probabilities
The log-likelihood transformation and its concavity
The gradient derivation via chain rule
Why no closed-form solution exists
The optimization algorithms that find the MLE efficiently

Module Complete

5 / 5