Loading learning content...
In linear regression, we have the elegant closed-form solution:
$$\hat{\boldsymbol{\theta}}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
One matrix inversion, and we're done. The solution emerges directly from the normal equations without any iteration.
When approaching logistic regression, one might hope for a similar formula. After all, both involve linear combinations of features. Both have convex optimization problems (in the case of logistic regression, concave log-likelihood which we maximize). Both have well-behaved mathematics.
But there is no closed-form solution for logistic regression. The parameters cannot be expressed as a finite sequence of algebraic operations on the data. This is not a failure of mathematical ingenuity—it's a fundamental structural property of the problem.
Understanding why no closed-form exists is as important as knowing the iterative solutions that replace it. This understanding illuminates the boundary between tractable and intractable optimization, and explains why numerical methods are essential in modern machine learning.
By the end of this page, you will understand exactly why the logistic regression score equations cannot be solved analytically, appreciate the fundamental mathematical obstacles, contrast with linear regression to see where the difference lies, and understand the implications for algorithm design and computation.
Let's first understand what makes linear regression special. The key is that the objective function is quadratic in the parameters.
Linear Regression Setup:
Key Observation: This is a quadratic function of $\boldsymbol{\theta}$:
Taking the Gradient:
$$\nabla_\theta \mathcal{J} = -\mathbf{X}^T\mathbf{y} + \mathbf{X}^T\mathbf{X}\boldsymbol{\theta}$$
This is linear in $\boldsymbol{\theta}$!
Setting Gradient to Zero:
$$\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}$$
This is a linear system of equations. Linear systems can be solved directly using matrix algebra:
$$\hat{\boldsymbol{\theta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
The recipe for closed-form solutions: (1) Objective is quadratic in parameters → (2) Gradient is linear in parameters → (3) Setting gradient to zero gives linear equations → (4) Linear equations have explicit solutions. Any break in this chain prevents closed forms.
Now let's examine the logistic regression case and see where the chain breaks.
Logistic Regression Log-Likelihood:
$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i (\boldsymbol{\theta}^T\mathbf{x}_i) - \log(1 + e^{\boldsymbol{\theta}^T\mathbf{x}_i}) \right]$$
The Gradient (Score Function):
$$\nabla_\theta \ell = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$$
where $\boldsymbol{\pi} = \sigma(\mathbf{X}\boldsymbol{\theta})$ and $\sigma$ is the sigmoid function.
The Score Equations (First-Order Conditions):
Setting the gradient to zero for a maximum:
$$\mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi}) = \mathbf{0}$$
Or equivalently:
$$\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$$
The Problem:
This equation looks superficially similar to the linear regression normal equations. But there's a crucial difference: $\boldsymbol{\pi}$ depends on $\boldsymbol{\theta}$ nonlinearly!
$$\boldsymbol{\pi}(\boldsymbol{\theta}) = \frac{1}{1 + e^{-\mathbf{X}\boldsymbol{\theta}}}$$
The sigmoid function introduces exponentials that cannot be algebraically inverted.
| Property | Linear Regression | Logistic Regression |
|---|---|---|
| Score Equation | $\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}$ | $\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$ |
| LHS Dependence on $\boldsymbol{\theta}$ | Linear | Nonlinear (sigmoid) |
| Equation Type | Linear system | Nonlinear system |
| Solution Method | Matrix inversion | Iterative optimization |
| Closed Form? | Yes ✓ | No ✗ |
The equation $\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y} / \mathbf{X}^T\mathbf{1}$ would give us $\boldsymbol{\theta}$ if we could invert the sigmoid. But $\sigma^{-1}(\pi) = \log(\pi / (1-\pi))$ operates element-wise, while $\mathbf{X}\boldsymbol{\theta}$ couples all parameters to all observations. There's no way to isolate $\boldsymbol{\theta}$ algebraically.
Let's dig deeper into why the nonlinearity blocks algebraic solutions.
The Score Equation, Expanded:
Component $j$ of the score equation is:
$$\sum_{i=1}^{n} \left( y_i - \frac{1}{1 + e^{-\sum_{k=0}^{p} \theta_k x_{ik}}} \right) x_{ij} = 0$$
This is a transcendental equation in the parameters ${\theta_0, \ldots, \theta_p}$.
Why Transcendental Equations Are Hard:
A transcendental equation contains transcendental functions (exponentials, logarithms, trigonometric functions) that cannot be expressed using a finite number of algebraic operations.
Consider the simplest case: one observation, one parameter (no intercept):
$$y_1 - \frac{1}{1 + e^{-\theta_1 x_{11}}} = 0$$
Solving for $\theta_1$:
$$\frac{1}{1 + e^{-\theta_1 x_{11}}} = y_1$$
$$1 + e^{-\theta_1 x_{11}} = \frac{1}{y_1}$$
$$e^{-\theta_1 x_{11}} = \frac{1-y_1}{y_1}$$
If $y_1 = 1$: $e^{-\theta_1 x_{11}} = 0$, which requires $\theta_1 \to +\infty$ (if $x_{11} > 0$) If $y_1 = 0$: $e^{-\theta_1 x_{11}} = \infty$, which requires $\theta_1 \to -\infty$ (if $x_{11} > 0$)
Multiple Observations Create a Coupled System:
With $n > 1$ observations, we have $p+1$ equations:
$$\sum_{i=1}^{n} \sigma(\boldsymbol{\theta}^T\mathbf{x}i) x{ij} = \sum_{i=1}^{n} y_i x_{ij} \quad \text{for } j = 0, 1, \ldots, p$$
Each equation involves all parameters through the sigmoid. The parameters are coupled—you cannot solve for one without knowing the others.
Contrast with Linear Systems:
In linear regression, $\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}$ can be decoupled via matrix factorization (LU, QR, Cholesky). The linearity allows us to express solution components independently.
In logistic regression, the nonlinearity means each equation involves transcendental functions of linear combinations of unknowns. No algebraic manipulation can separate them.
This isn't a matter of not having discovered the right trick yet. The theory of transcendental equations (going back to Abel and Galois) shows that most systems involving exponentials have no closed-form solutions. The five-degree polynomial generalized to show that most equations are not algebraically solvable.
Another way to understand why iterative methods are necessary is through the fixed-point perspective.
Rewriting the Score Equations:
Consider trying to isolate $\boldsymbol{\theta}$ from $\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$.
If we could invert, we'd want something like:
$$\boldsymbol{\theta} = g(\boldsymbol{\pi}(\boldsymbol{\theta}))$$
But $\boldsymbol{\pi}$ depends on $\boldsymbol{\theta}$, so we get:
$$\boldsymbol{\theta} = g(\boldsymbol{\pi}(\boldsymbol{\theta})) = f(\boldsymbol{\theta})$$
This is a fixed-point equation: we seek $\boldsymbol{\theta}$ such that applying $f$ returns $\boldsymbol{\theta}$ unchanged.
Fixed-Point Iteration:
The natural approach is iteration:
$$\boldsymbol{\theta}^{(t+1)} = f(\boldsymbol{\theta}^{(t)})$$
Starting from some $\boldsymbol{\theta}^{(0)}$, we repeatedly apply $f$ until convergence. This is exactly what iterative optimization does—each step moves closer to the fixed point (the MLE).
The Contraction Mapping Theorem:
If $f$ is a contraction (i.e., $|f(\boldsymbol{\theta}) - f(\boldsymbol{\theta}')| \leq L|\boldsymbol{\theta} - \boldsymbol{\theta}'|$ for some $L < 1$), then:
The sophisticated optimization methods we'll study (Newton-Raphson, IRLS) are designed to be fast contractions near the solution.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as np def sigmoid(z): return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) def naive_fixed_point_iteration(X, y, max_iter=1000, tol=1e-6): """ Attempt naive fixed-point iteration (for illustration). The score equation X^T π = X^T y suggests: π = (X^T)^{-1} X^T y (if X^T were invertible) But π = σ(Xθ), and we can't invert σ directly on vectors. This demonstrates why simple iteration doesn't work. """ n, p = X.shape theta = np.zeros(p) for iteration in range(max_iter): # Forward pass z = X @ theta pi = sigmoid(z) # Residual residual = y - pi # Gradient gradient = X.T @ residual # Naive update: θ += η * gradient (this IS gradient ascent!) # There's no "closed form iteration" without choosing a step size if np.linalg.norm(gradient) < tol: print(f"Converged at iteration {iteration}") break # We MUST choose a learning rate - this is optimization, not closed-form eta = 0.1 theta = theta + eta * gradient return theta, gradient # This demonstrates that even "fixed-point" thinking leads to iterationnp.random.seed(42)n, p = 50, 3X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])theta_true = np.array([0.5, 1.0, -0.5])z = X @ theta_truey = (np.random.rand(n) < sigmoid(z)).astype(float) theta_est, final_grad = naive_fixed_point_iteration(X, y)print(f"\nEstimated θ: {theta_est}")print(f"True θ: {theta_true}")print(f"Final gradient norm: {np.linalg.norm(final_grad):.6f}")print("\nNote: This is just gradient ascent in disguise!")No matter how you formulate the problem—as fixed-point, as root-finding, as optimization—you end up with iteration. The question is not whether to iterate, but how to iterate efficiently. This motivates the algorithms in the next page.
While there's no closed-form in general, it's worth examining special cases to understand the structure better.
Case 1: Orthogonal Design with Single Feature
Consider $n$ observations with a single binary feature $x \in {0, 1}$ (plus intercept). Let:
The MLEs can be computed:
$$\hat{\pi}0 = \frac{y{0+}}{n_0}, \quad \hat{\pi}1 = \frac{y{1+}}{n_1}$$
Then: $\hat{\theta}_0 = \log\frac{\hat{\pi}_0}{1-\hat{\pi}_0}$ (log-odds in group 0) And: $\hat{\theta}_1 = \log\frac{\hat{\pi}_1}{1-\hat{\pi}_1} - \hat{\theta}_0$ (log-odds ratio)
But this "closed form" is only for the probability space; the logit transformation is still needed.
Case 2: Perfect Separation
When one class can be perfectly separated from the other by a hyperplane:
$$\exists \boldsymbol{\theta}: \boldsymbol{\theta}^T\mathbf{x}_i > 0 \text{ for all } y_i = 1 \text{ and } \boldsymbol{\theta}^T\mathbf{x}_i < 0 \text{ for all } y_i = 0$$
In this case, the MLE does not exist—parameters diverge to $\pm\infty$. The likelihood can be made arbitrarily close to 1 by scaling $\boldsymbol{\theta}$.
Case 3: Complete Separation
Even stricter than perfect separation: when classes have no overlap in feature space, The log-likelihood is bounded but no finite maximizer exists.
| Case | Description | MLE Behavior |
|---|---|---|
| General Case | Overlapping classes, no exact formula | Finite MLE exists, found by iteration |
| Orthogonal/Simple | Special structure allows simplification | Can compute closed-form for probabilities, then transform |
| Perfect Separation | A hyperplane separates classes exactly | MLE doesn't exist; parameters → ±∞ |
| Quasi-Separation | Hyperplane separates most but not all | Some coefficients → ±∞; others finite |
In high-dimensional settings (p > n), perfect separation is nearly guaranteed. This is one motivation for regularization: L2 regularization ensures a unique, finite MLE exists even with separation. Regularization isn't just about generalization—it's about having a well-defined optimization problem.
Given the lack of exact solutions, researchers have explored approximate closed-form solutions. These are useful for understanding and for initialization.
Taylor Expansion Approach:
Linearizing the sigmoid around some point $\boldsymbol{\theta}_0$:
$$\sigma(z) \approx \sigma(z_0) + \sigma'(z_0)(z - z_0)$$
This leads to the Iteratively Reweighted Least Squares (IRLS) update, which we'll study in the next page. Each iteration solves a weighted linear regression—"almost" closed-form, repeated.
Working Response Linearization:
At each iteration, define "working responses" and "working weights" that convert the problem to weighted least squares:
$$\tilde{y}_i = z_i + \frac{y_i - \pi_i}{\pi_i(1-\pi_i)}$$ $$w_i = \pi_i(1-\pi_i)$$
Then the update is:
$$\boldsymbol{\theta}^{\text{new}} = (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}\tilde{\mathbf{y}}$$
This looks like weighted least squares! But $\mathbf{W}$ and $\tilde{\mathbf{y}}$ depend on $\boldsymbol{\theta}^{\text{old}}$, so iteration is still required.
One-Step Estimators:
For inference purposes, sometimes a single Newton or IRLS step from a good initial estimate is "good enough":
$$\hat{\boldsymbol{\theta}}_{\text{one-step}} = \boldsymbol{\theta}^{(0)} - \mathbf{H}(\boldsymbol{\theta}^{(0)})^{-1} \nabla \ell(\boldsymbol{\theta}^{(0)})$$
If $\boldsymbol{\theta}^{(0)}$ is $\sqrt{n}$-consistent, the one-step estimator is asymptotically efficient. But you still need iteration to get a good $\boldsymbol{\theta}^{(0)}$.
IRLS is the bridge between 'no closed form' and 'this is manageable.' Each step solves a weighted linear system (a well-understood problem with efficient algorithms). The weights and responses change between iterations, but within each step, we're doing familiar linear algebra. This is why logistic regression is computationally tractable despite lacking a closed-form solution.
The lack of a closed-form solution has profound implications for how we design and implement logistic regression.
Initialization Matters:
Without a formula that gives the answer directly, we must start somewhere. The choice of initial $\boldsymbol{\theta}^{(0)}$ affects:
For logistic regression (convex), any starting point works, but smart initialization (e.g., from a linear regression on log-odds) can save iterations.
Convergence Criteria:
We must decide when to stop iterating:
Each criterion has tradeoffs; combining multiple is common practice.
Algorithm Selection:
Different algorithms have different convergence properties:
Practical Considerations:
In production systems, the lack of closed-form means:
Without convexity, the lack of closed-form would be far more problematic—we'd face local minima, saddle points, and sensitivity to initialization. Because logistic regression is convex, we can confidently state: 'iterate until gradient ≈ 0, and you have the global optimum.' This guarantee is precious and shouldn't be taken for granted.
We've established a fundamental fact about logistic regression: the MLE cannot be computed in closed form. This isn't a failure of technique—it's a structural property of the problem.
What's Next:
We've established that iteration is necessary. The final page of this module develops the major optimization approaches for logistic regression: gradient descent, Newton-Raphson, and IRLS. We'll see how the concavity we proved earlier, combined with the gradient we derived, enables efficient algorithms that reliably find the global maximum.
You now understand why logistic regression lacks a closed-form solution, can trace the mathematical obstacles to the sigmoid nonlinearity, and appreciate why iterative optimization is not just convenient but necessary. Next, we'll master the algorithms that make training efficient.