Loading learning content...
We have built the theoretical foundation for Maximum Likelihood Estimation in logistic regression:
Now we must put this theory into practice. How do we actually compute the MLE? What algorithms converge efficiently? What are the computational tradeoffs?
This page develops the major optimization approaches for logistic regression, from basic gradient descent to sophisticated second-order methods. Understanding these algorithms is essential not just for logistic regression, but for the broader field of machine learning where similar optimization challenges arise constantly.
By the end of this page, you will understand gradient descent and its variants, master Newton-Raphson and its connection to IRLS, appreciate the computational tradeoffs between methods, and be able to implement these algorithms from scratch with confidence.
The simplest optimization algorithm is gradient descent (or gradient ascent, when maximizing). The idea is elementary: move in the direction of the gradient.
The Algorithm:
Since we want to maximize the log-likelihood, we use gradient ascent:
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \eta \nabla_\theta \ell(\boldsymbol{\theta}^{(t)})$$
where $\eta > 0$ is the learning rate (step size).
Substituting the gradient we derived:
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \eta \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi}^{(t)})$$
Why It Works:
The gradient $\nabla_\theta \ell$ points in the direction of steepest increase of $\ell$. By moving in this direction (with small enough $\eta$), we guarantee the log-likelihood increases:
$$\ell(\boldsymbol{\theta}^{(t+1)}) \geq \ell(\boldsymbol{\theta}^{(t)})$$
Since $\ell$ is concave, the only stationary point ($\nabla_\theta \ell = 0$) is the global maximum. Hence gradient ascent converges to the MLE.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as np def sigmoid(z): """Numerically stable sigmoid.""" return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) def log_likelihood(theta, X, y): """Compute log-likelihood.""" z = X @ theta log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z))) return np.sum(y * z - log_partition) def gradient(theta, X, y): """Compute gradient of log-likelihood.""" z = X @ theta pi = sigmoid(z) return X.T @ (y - pi) def gradient_ascent(X, y, eta=0.1, max_iter=1000, tol=1e-6): """ Gradient ascent for logistic regression. Parameters ---------- X : ndarray (n, p+1) Design matrix with intercept column y : ndarray (n,) Binary labels eta : float Learning rate max_iter : int Maximum iterations tol : float Convergence tolerance on gradient norm Returns ------- theta : ndarray (p+1,) MLE parameters history : dict Training history (log-likelihood, gradient norms) """ n, p = X.shape theta = np.zeros(p) # Initialize at origin history = {'log_likelihood': [], 'grad_norm': []} for iteration in range(max_iter): # Compute gradient grad = gradient(theta, X, y) grad_norm = np.linalg.norm(grad) # Store history ll = log_likelihood(theta, X, y) history['log_likelihood'].append(ll) history['grad_norm'].append(grad_norm) # Check convergence if grad_norm < tol: print(f"Converged at iteration {iteration}") break # Update parameters theta = theta + eta * grad return theta, history # Example usagenp.random.seed(42)n, p = 100, 3X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])theta_true = np.array([0.5, 1.0, -0.5])z = X @ theta_truey = (np.random.rand(n) < sigmoid(z)).astype(float) theta_gd, history = gradient_ascent(X, y, eta=0.1, max_iter=500)print(f"\nGradient Descent estimate: {theta_gd}")print(f"True parameters: {theta_true}")print(f"Final log-likelihood: {history['log_likelihood'][-1]:.4f}")Choosing $\eta$ is tricky. Theory says $\eta < 2/L$ where $L$ is the Lipschitz constant of the gradient, but computing $L$ requires knowing properties of the Hessian. Line search (trying to find optimal $\eta$ per iteration) or adaptive methods (Adam, AdaGrad) are practical alternatives.
Newton-Raphson (or simply Newton's method) uses second-order information (the Hessian) to achieve faster convergence.
The Idea:
Instead of moving a fixed step in the gradient direction, Newton's method approximates the objective locally with a quadratic function and jumps directly to the maximum of that approximation.
Second-Order Taylor Expansion:
Near current point $\boldsymbol{\theta}^{(t)}$:
$$\ell(\boldsymbol{\theta}) \approx \ell(\boldsymbol{\theta}^{(t)}) + \nabla_\theta \ell^T (\boldsymbol{\theta} - \boldsymbol{\theta}^{(t)}) + \frac{1}{2}(\boldsymbol{\theta} - \boldsymbol{\theta}^{(t)})^T \mathbf{H} (\boldsymbol{\theta} - \boldsymbol{\theta}^{(t)})$$
where $\mathbf{H} = \nabla^2_\theta \ell = -\mathbf{X}^T\mathbf{W}\mathbf{X}$ is the Hessian.
Maximizing the Quadratic Approximation:
Setting the derivative of the approximation to zero:
$$\nabla_\theta \ell + \mathbf{H}(\boldsymbol{\theta} - \boldsymbol{\theta}^{(t)}) = 0$$
Solving for $\boldsymbol{\theta}$:
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \mathbf{H}^{-1} \nabla_\theta \ell$$
Since we're maximizing a concave function, $\mathbf{H}$ is negative definite, so $-\mathbf{H}^{-1}$ is positive definite and the update moves toward the maximum.
For Logistic Regression:
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + (\mathbf{X}^T\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi}^{(t)})$$
where $\mathbf{W}^{(t)} = \text{diag}(\pi_1^{(t)}(1-\pi_1^{(t)}), \ldots, \pi_n^{(t)}(1-\pi_n^{(t)}))$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as np def sigmoid(z): return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) def newton_raphson(X, y, max_iter=25, tol=1e-8): """ Newton-Raphson for logistic regression. Update: θ^(t+1) = θ^(t) + (X^T W X)^{-1} X^T (y - π) where W = diag(π_i(1-π_i)) Returns ------- theta : MLE parameters history : dict with convergence info """ n, p = X.shape theta = np.zeros(p) history = {'log_likelihood': [], 'grad_norm': [], 'step_norm': []} for iteration in range(max_iter): # Forward pass z = X @ theta pi = sigmoid(z) # Gradient residual = y - pi grad = X.T @ residual grad_norm = np.linalg.norm(grad) # Log-likelihood log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z))) ll = np.sum(y * z - log_partition) history['log_likelihood'].append(ll) history['grad_norm'].append(grad_norm) # Check convergence if grad_norm < tol: print(f"Newton-Raphson converged at iteration {iteration}") break # Weight matrix W = diag(π_i(1-π_i)) w = pi * (1 - pi) w = np.maximum(w, 1e-10) # Numerical stability # Hessian: -X^T W X (negative definite) # We need (X^T W X)^{-1} = (-H)^{-1} XtWX = X.T @ (w[:, np.newaxis] * X) # Efficient weighted multiply # Newton step: (X^T W X)^{-1} X^T (y - π) try: step = np.linalg.solve(XtWX, grad) except np.linalg.LinAlgError: print("Hessian singular; adding regularization") step = np.linalg.solve(XtWX + 1e-6 * np.eye(p), grad) history['step_norm'].append(np.linalg.norm(step)) # Update theta = theta + step return theta, history # Compare with gradient descentnp.random.seed(42)n, p = 100, 3X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])theta_true = np.array([0.5, 1.0, -0.5])z = X @ theta_truey = (np.random.rand(n) < sigmoid(z)).astype(float) theta_nr, history_nr = newton_raphson(X, y)print(f"\nNewton-Raphson estimate: {theta_nr}")print(f"True parameters: {theta_true}")print(f"Iterations: {len(history_nr['log_likelihood'])}")print(f"Final gradient norm: {history_nr['grad_norm'][-1]:.2e}")| Property | Gradient Descent | Newton-Raphson |
|---|---|---|
| Convergence Rate | Linear: $|\theta^{(t)} - \theta^*| \leq c^t$ | Quadratic: $|\theta^{(t)} - \theta^| \leq c \cdot |\theta^{(t-1)} - \theta^|^2$ |
| Iterations to Converge | Many (100s-1000s) | Few (5-15 typical) |
| Cost per Iteration | $O(np)$ | $O(np^2 + p^3)$ for matrix solve |
| Memory | $O(p)$ | $O(p^2)$ for Hessian |
| Hyperparameters | Learning rate $\eta$ | None (self-scaling) |
| Robustness | Sensitive to $\eta$ | Can fail if Hessian singular |
Quadratic convergence means the number of correct digits roughly doubles each iteration. If iteration 5 gives 3 correct digits, iteration 6 gives ~6, iteration 7 gives ~12. Newton typically converges in 5-10 iterations regardless of problem size, while gradient descent may need thousands.
IRLS is Newton-Raphson in disguise, reformulated as a sequence of weighted least squares problems. This perspective provides intuition and connects to familiar linear regression methods.
The IRLS Formulation:
Rearranging the Newton-Raphson update:
$$\boldsymbol{\theta}^{(t+1)} = (\mathbf{X}^T\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}^{(t)}\tilde{\mathbf{y}}^{(t)}$$
where the working response is:
$$\tilde{y}_i^{(t)} = z_i^{(t)} + \frac{y_i - \pi_i^{(t)}}{\pi_i^{(t)}(1-\pi_i^{(t)})}$$
and the working weight is:
$$w_i^{(t)} = \pi_i^{(t)}(1-\pi_i^{(t)})$$
Interpretation:
Each IRLS iteration solves a weighted least squares problem:
$$\boldsymbol{\theta}^{(t+1)} = \arg\min_\theta \sum_{i=1}^{n} w_i^{(t)} (\tilde{y}_i^{(t)} - \boldsymbol{\theta}^T\mathbf{x}_i)^2$$
The weights $w_i$ and responses $\tilde{y}_i$ depend on the previous iteration's predictions.
Why "Reweighted"?
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as np def sigmoid(z): return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) def irls(X, y, max_iter=25, tol=1e-8): """ Iteratively Reweighted Least Squares for logistic regression. Each iteration solves: θ = (X^T W X)^{-1} X^T W ỹ where: - W = diag(π_i(1-π_i)) [working weights] - ỹ_i = z_i + (y_i - π_i) / (π_i(1-π_i)) [working response] This is equivalent to Newton-Raphson. """ n, p = X.shape theta = np.zeros(p) history = {'log_likelihood': [], 'theta': [theta.copy()]} for iteration in range(max_iter): # Forward pass z = X @ theta pi = sigmoid(z) # Clip for numerical stability pi = np.clip(pi, 1e-10, 1 - 1e-10) # Working weights: w_i = π_i(1-π_i) w = pi * (1 - pi) # Working response: ỹ_i = z_i + (y_i - π_i) / w_i working_response = z + (y - pi) / w # Weighted least squares: θ = (X^T W X)^{-1} X^T W ỹ W_sqrt = np.sqrt(w) X_weighted = X * W_sqrt[:, np.newaxis] y_weighted = working_response * W_sqrt # Solve via QR decomposition for stability Q, R = np.linalg.qr(X_weighted) theta_new = np.linalg.solve(R, Q.T @ y_weighted) # Check convergence if np.linalg.norm(theta_new - theta) < tol: print(f"IRLS converged at iteration {iteration}") theta = theta_new break theta = theta_new # Log-likelihood z = X @ theta log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z))) ll = np.sum(y * z - log_partition) history['log_likelihood'].append(ll) history['theta'].append(theta.copy()) return theta, history # Demonstrate IRLSnp.random.seed(42)n, p = 100, 3X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])theta_true = np.array([0.5, 1.0, -0.5])z = X @ theta_truey = (np.random.rand(n) < sigmoid(z)).astype(float) theta_irls, history_irls = irls(X, y)print(f"\nIRLS estimate: {theta_irls}")print(f"True params: {theta_true}")print(f"Iterations: {len(history_irls['log_likelihood'])}")The IRLS algorithm generalizes naturally to all Generalized Linear Models with canonical links. The form is the same: working weights from the variance function, working response from the link. This is implemented in R's glm() function and Python's statsmodels. Understanding IRLS for logistic regression teaches you the pattern for Poisson, negative binomial, and other GLMs.
For massive datasets where computing the full gradient $\mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$ is expensive, Stochastic Gradient Descent (SGD) offers a scalable alternative.
The Idea:
Instead of computing the exact gradient using all $n$ observations, use a noisy estimate from a small subset (minibatch):
$$\nabla_\theta \ell \approx \frac{n}{|B|} \sum_{i \in B} (y_i - \pi_i) \mathbf{x}_i$$
where $B$ is a randomly selected minibatch of size $|B| \ll n$.
The SGD Update:
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \eta_t \cdot \frac{n}{|B|} \sum_{i \in B} (y_i - \pi_i^{(t)}) \mathbf{x}_i$$
Key Properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as np def sigmoid(z): return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) def sgd_logistic(X, y, batch_size=32, epochs=10, eta0=0.1, decay='inverse', random_state=None): """ Stochastic Gradient Descent for logistic regression. Parameters ---------- batch_size : int Minibatch size epochs : int Number of passes through the data eta0 : float Initial learning rate decay : str Learning rate schedule: 'constant', 'inverse', 'sqrt' """ if random_state is not None: np.random.seed(random_state) n, p = X.shape theta = np.zeros(p) history = {'log_likelihood': []} t = 0 # Iteration counter for epoch in range(epochs): # Shuffle data at each epoch indices = np.random.permutation(n) for start in range(0, n, batch_size): # Get minibatch batch_idx = indices[start:start + batch_size] X_batch = X[batch_idx] y_batch = y[batch_idx] # Compute minibatch gradient z_batch = X_batch @ theta pi_batch = sigmoid(z_batch) grad = X_batch.T @ (y_batch - pi_batch) # Scale gradient (optional: makes learning rate more interpretable) # grad = grad * (n / len(batch_idx)) # Learning rate schedule t += 1 if decay == 'constant': eta = eta0 elif decay == 'inverse': eta = eta0 / t elif decay == 'sqrt': eta = eta0 / np.sqrt(t) else: eta = eta0 # Update theta = theta + eta * grad # Compute full log-likelihood at end of epoch z = X @ theta log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z))) ll = np.sum(y * z - log_partition) history['log_likelihood'].append(ll) return theta, history # Demonp.random.seed(42)n, p = 1000, 5X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])theta_true = np.random.randn(p)z = X @ theta_truey = (np.random.rand(n) < sigmoid(z)).astype(float) theta_sgd, history_sgd = sgd_logistic(X, y, batch_size=64, epochs=50, eta0=0.5, decay='sqrt', random_state=42)print(f"SGD estimate: {theta_sgd}")print(f"True params: {theta_true}")Use SGD when $n$ is very large (millions+) and computing the full gradient is impractical. For moderate $n$ (thousands to hundreds of thousands), batch methods (Newton, L-BFGS) are often faster to converge. SGD shines when you'd rather make progress on a cheap approximation than wait for an expensive exact computation.
Quasi-Newton methods approximate the Hessian (or its inverse) using gradient information, achieving near-Newton convergence without the $O(p^3)$ matrix inversion cost.
The Challenge with Newton:
Newton's method requires computing and inverting $\mathbf{H} = -\mathbf{X}^T\mathbf{W}\mathbf{X}$, a $p \times p$ matrix. For large $p$ (thousands of features), this is prohibitive in both time ($O(p^3)$) and memory ($O(p^2)$).
BFGS (Broyden-Fletcher-Goldfarb-Shanno):
BFGS maintains an approximation $\mathbf{B}$ to the inverse Hessian, updated each iteration using gradient differences:
$$\mathbf{s}^{(t)} = \boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}$$ $$\mathbf{g}^{(t)} = \nabla_\theta \ell(\boldsymbol{\theta}^{(t+1)}) - \nabla_\theta \ell(\boldsymbol{\theta}^{(t)})$$
The update ensures the approximation satisfies the secant condition: $\mathbf{B}^{(t+1)}\mathbf{g}^{(t)} = \mathbf{s}^{(t)}$.
L-BFGS (Limited-memory BFGS):
Instead of storing the full $p \times p$ matrix, L-BFGS stores only the last $m$ pairs $(\mathbf{s}^{(t)}, \mathbf{g}^{(t)})$ and computes the approximate Hessian-vector product implicitly.
Memory: $O(mp)$ instead of $O(p^2)$, where $m$ is typically 5-20.
| Method | Hessian | Memory | Convergence |
|---|---|---|---|
| Newton | Exact, $O(p^3)$ compute | $O(p^2)$ | Quadratic |
| BFGS | Approximate, $O(p^2)$ update | $O(p^2)$ | Superlinear |
| L-BFGS | Implicit, $O(mp)$ compute | $O(mp)$ | Superlinear |
| Gradient Descent | None (identity) | $O(p)$ | Linear |
1234567891011121314151617181920212223242526272829303132333435363738394041424344
import numpy as npfrom scipy.optimize import minimize def sigmoid(z): return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) def neg_log_likelihood(theta, X, y): """Negative log-likelihood (for minimization).""" z = X @ theta log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z))) return -np.sum(y * z - log_partition) def neg_gradient(theta, X, y): """Negative gradient (for minimization).""" z = X @ theta pi = sigmoid(z) return -X.T @ (y - pi) # Example with L-BFGS-Bnp.random.seed(42)n, p = 1000, 50 # Large p where Newton would be expensiveX = np.column_stack([np.ones(n), np.random.randn(n, p-1)])theta_true = np.random.randn(p) * 0.5z = X @ theta_truey = (np.random.rand(n) < sigmoid(z)).astype(float) # L-BFGS optimizationresult = minimize( fun=neg_log_likelihood, x0=np.zeros(p), args=(X, y), method='L-BFGS-B', jac=neg_gradient, options={'maxiter': 100, 'disp': True}) theta_lbfgs = result.xprint(f"\nL-BFGS converged: {result.success}")print(f"Iterations: {result.nit}")print(f"Function evaluations: {result.nfev}")print(f"Gradient evaluations: {result.njev}")print(f"Final loss: {result.fun:.4f}")For most practical logistic regression problems with moderate $n$ and $p$, L-BFGS is the best default. It's implemented in scipy, sklearn's LogisticRegression (solver='lbfgs'), and most ML libraries. It combines Newton-like convergence with gradient-descent-like memory usage.
In practice, we often add regularization to prevent overfitting and ensure the MLE exists (even with separable data).
L2 Regularization (Ridge):
The regularized objective is:
$$\mathcal{J}(\boldsymbol{\theta}) = -\ell(\boldsymbol{\theta}) + \frac{\lambda}{2}|\boldsymbol{\theta}|_2^2$$
where $\lambda > 0$ is the regularization strength.
Modified Gradient:
$$\nabla_\theta \mathcal{J} = -\mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi}) + \lambda\boldsymbol{\theta}$$
Modified Hessian:
$$\nabla^2_\theta \mathcal{J} = \mathbf{X}^T\mathbf{W}\mathbf{X} + \lambda\mathbf{I}$$
The $\lambda\mathbf{I}$ term ensures the Hessian is always positive definite, avoiding singularity issues.
L1 Regularization (Lasso):
For L1 regularization, the objective is non-differentiable at $\theta_j = 0$:
$$\mathcal{J}(\boldsymbol{\theta}) = -\ell(\boldsymbol{\theta}) + \lambda|\boldsymbol{\theta}|_1$$
This requires specialized algorithms:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as npfrom scipy.optimize import minimize def sigmoid(z): return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) def objective_l2(theta, X, y, lambda_reg): """Negative log-likelihood + L2 penalty.""" z = X @ theta log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z))) nll = -np.sum(y * z - log_partition) l2_penalty = 0.5 * lambda_reg * np.sum(theta**2) return nll + l2_penalty def gradient_l2(theta, X, y, lambda_reg): """Gradient of regularized objective.""" z = X @ theta pi = sigmoid(z) grad_nll = -X.T @ (y - pi) grad_l2 = lambda_reg * theta return grad_nll + grad_l2 def fit_logistic_l2(X, y, lambda_reg=1.0): """Fit L2-regularized logistic regression.""" n, p = X.shape result = minimize( fun=objective_l2, x0=np.zeros(p), args=(X, y, lambda_reg), method='L-BFGS-B', jac=gradient_l2 ) return result.x # Demo with regularizationnp.random.seed(42)n, p = 100, 50 # More features than observationsX = np.column_stack([np.ones(n), np.random.randn(n, p-1)])theta_true = np.zeros(p)theta_true[:5] = np.array([0.5, 1.0, -0.5, 0.8, -0.3]) # Sparse true modelz = X @ theta_truey = (np.random.rand(n) < sigmoid(z)).astype(float) # Compare different regularization strengthsfor lam in [0.01, 0.1, 1.0, 10.0]: theta_est = fit_logistic_l2(X, y, lambda_reg=lam) nonzero = np.sum(np.abs(theta_est) > 0.01) print(f"λ={lam:5.2f}: ||θ||₂ = {np.linalg.norm(theta_est):.3f}, " f"active features = {nonzero}")Unless you have strong theoretical reasons not to, always use regularization. It prevents numerical issues with separable data, improves generalization, and rarely hurts if $\lambda$ is chosen via cross-validation. sklearn's LogisticRegression uses L2 regularization by default.
How do we know when optimization has converged? Several criteria are used in practice:
1. Gradient Norm:
$$|\nabla_\theta \ell| < \epsilon_{\text{grad}}$$
The gradient should be near zero at the optimum. Typical threshold: $10^{-6}$ to $10^{-8}$.
2. Parameter Change:
$$|\boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}| < \epsilon_{\theta}$$
Parameters should stabilize. Useful but can be misleading in flat regions.
3. Objective Change:
$$|\ell^{(t+1)} - \ell^{(t)}| < \epsilon_{\ell}$$
Or relative: $|\ell^{(t+1)} - \ell^{(t)}| / |\ell^{(t)}| < \epsilon$
4. Maximum Iterations:
Always include a maximum iteration count to prevent infinite loops.
Warning Signs:
Best Practices:
A converged optimizer may still be wrong if the code has bugs, the data is corrupted, or the model is misspecified. Always sanity-check your results: Do coefficients have sensible signs and magnitudes? Does accuracy on held-out data match expectations? Do predictions cover the full range [0, 1]?
We've explored the major optimization approaches for logistic regression MLE. Each has its place in the practitioner's toolkit.
| Algorithm | Best For | Key Tradeoff |
|---|---|---|
| Gradient Descent | Learning, simple problems | Simple but slow |
| Newton-Raphson/IRLS | Small-medium $p$, high precision | Fast convergence, expensive iteration |
| L-BFGS | Medium-large $p$, general use | Good balance of speed and memory |
| SGD | Very large $n$, online learning | Scalable but noisy convergence |
| Coordinate Descent | L1 regularization, sparse models | Per-coordinate updates, feature selection |
Module Complete:
You have now mastered Maximum Likelihood Estimation for logistic regression from theory to practice:
This knowledge extends far beyond logistic regression. The same principles—likelihood, gradient, iterative optimization—underpin neural networks, probabilistic graphical models, and much of modern machine learning.
Congratulations! You now have a complete, rigorous understanding of Maximum Likelihood Estimation for logistic regression. You can derive the likelihood and gradient from scratch, understand why iterative optimization is necessary, and implement multiple algorithms. This foundation prepares you for multi-class logistic regression, regularized methods, and the broader world of probabilistic machine learning.