Machine LearningLogistic Regression & GLMs

Maximum Likelihood Estimation for Logistic Regression

LevelIntermediate

Duration120 mins

TopicLogistic Regression & GLMs

4 / 5

No Closed-Form Solution

Why We Cannot Solve Directly

In linear regression, we have the elegant closed-form solution:

$$\hat{\boldsymbol{\theta}}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

One matrix inversion, and we're done. The solution emerges directly from the normal equations without any iteration.

When approaching logistic regression, one might hope for a similar formula. After all, both involve linear combinations of features. Both have convex optimization problems (in the case of logistic regression, concave log-likelihood which we maximize). Both have well-behaved mathematics.

But there is no closed-form solution for logistic regression. The parameters cannot be expressed as a finite sequence of algebraic operations on the data. This is not a failure of mathematical ingenuity—it's a fundamental structural property of the problem.

Understanding why no closed-form exists is as important as knowing the iterative solutions that replace it. This understanding illuminates the boundary between tractable and intractable optimization, and explains why numerical methods are essential in modern machine learning.

What You Will Learn

By the end of this page, you will understand exactly why the logistic regression score equations cannot be solved analytically, appreciate the fundamental mathematical obstacles, contrast with linear regression to see where the difference lies, and understand the implications for algorithm design and computation.

Review: Why Linear Regression Has a Closed-Form

Let's first understand what makes linear regression special. The key is that the objective function is quadratic in the parameters.

Linear Regression Setup:

Model: $\hat{y}_i = \boldsymbol{\theta}^T \mathbf{x}_i$
Loss: Squared error $\mathcal{J}(\boldsymbol{\theta}) = \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\theta}|^2$
Expanded: $\mathcal{J}(\boldsymbol{\theta}) = \frac{1}{2}(\mathbf{y}^T\mathbf{y} - 2\mathbf{y}^T\mathbf{X}\boldsymbol{\theta} + \boldsymbol{\theta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\theta})$

Key Observation: This is a quadratic function of $\boldsymbol{\theta}$:

Constant term: $\frac{1}{2}\mathbf{y}^T\mathbf{y}$
Linear term: $-\mathbf{y}^T\mathbf{X}\boldsymbol{\theta}$
Quadratic term: $\frac{1}{2}\boldsymbol{\theta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\theta}$

Taking the Gradient:

$$\nabla_\theta \mathcal{J} = -\mathbf{X}^T\mathbf{y} + \mathbf{X}^T\mathbf{X}\boldsymbol{\theta}$$

This is linear in $\boldsymbol{\theta}$!

Setting Gradient to Zero:

$$\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}$$

This is a linear system of equations. Linear systems can be solved directly using matrix algebra:

$$\hat{\boldsymbol{\theta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

The Pattern

The recipe for closed-form solutions: (1) Objective is quadratic in parameters → (2) Gradient is linear in parameters → (3) Setting gradient to zero gives linear equations → (4) Linear equations have explicit solutions. Any break in this chain prevents closed forms.

The Logistic Regression Score Equations

Now let's examine the logistic regression case and see where the chain breaks.

Logistic Regression Log-Likelihood:

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i (\boldsymbol{\theta}^T\mathbf{x}_i) - \log(1 + e^{\boldsymbol{\theta}^T\mathbf{x}_i}) \right]$$

The Gradient (Score Function):

$$\nabla_\theta \ell = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$$

where $\boldsymbol{\pi} = \sigma(\mathbf{X}\boldsymbol{\theta})$ and $\sigma$ is the sigmoid function.

The Score Equations (First-Order Conditions):

Setting the gradient to zero for a maximum:

$$\mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi}) = \mathbf{0}$$

Or equivalently:

$$\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$$

The Problem:

This equation looks superficially similar to the linear regression normal equations. But there's a crucial difference: $\boldsymbol{\pi}$ depends on $\boldsymbol{\theta}$ nonlinearly!

$$\boldsymbol{\pi}(\boldsymbol{\theta}) = \frac{1}{1 + e^{-\mathbf{X}\boldsymbol{\theta}}}$$

The sigmoid function introduces exponentials that cannot be algebraically inverted.

Comparing the Score Equations
Property	Linear Regression	Logistic Regression
Score Equation	$\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}$	$\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$
LHS Dependence on $\boldsymbol{\theta}$	Linear	Nonlinear (sigmoid)
Equation Type	Linear system	Nonlinear system
Solution Method	Matrix inversion	Iterative optimization
Closed Form?	Yes ✓	No ✗

The Mathematical Reality

The equation $\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y} / \mathbf{X}^T\mathbf{1}$ would give us $\boldsymbol{\theta}$ if we could invert the sigmoid. But $\sigma^{-1}(\pi) = \log(\pi / (1-\pi))$ operates element-wise, while $\mathbf{X}\boldsymbol{\theta}$ couples all parameters to all observations. There's no way to isolate $\boldsymbol{\theta}$ algebraically.

Mathematical Analysis of the Nonlinearity

Let's dig deeper into why the nonlinearity blocks algebraic solutions.

The Score Equation, Expanded:

Component $j$ of the score equation is:

$$\sum_{i=1}^{n} \left( y_i - \frac{1}{1 + e^{-\sum_{k=0}^{p} \theta_k x_{ik}}} \right) x_{ij} = 0$$

This is a transcendental equation in the parameters ${\theta_0, \ldots, \theta_p}$.

Why Transcendental Equations Are Hard:

A transcendental equation contains transcendental functions (exponentials, logarithms, trigonometric functions) that cannot be expressed using a finite number of algebraic operations.

Consider the simplest case: one observation, one parameter (no intercept):

$$y_1 - \frac{1}{1 + e^{-\theta_1 x_{11}}} = 0$$

Solving for $\theta_1$:

$$\frac{1}{1 + e^{-\theta_1 x_{11}}} = y_1$$

$$1 + e^{-\theta_1 x_{11}} = \frac{1}{y_1}$$

$$e^{-\theta_1 x_{11}} = \frac{1-y_1}{y_1}$$

If $y_1 = 1$: $e^{-\theta_1 x_{11}} = 0$, which requires $\theta_1 \to +\infty$ (if $x_{11} > 0$) If $y_1 = 0$: $e^{-\theta_1 x_{11}} = \infty$, which requires $\theta_1 \to -\infty$ (if $x_{11} > 0$)

Multiple Observations Create a Coupled System:

With $n > 1$ observations, we have $p+1$ equations:

$$\sum_{i=1}^{n} \sigma(\boldsymbol{\theta}^T\mathbf{x}i) x{ij} = \sum_{i=1}^{n} y_i x_{ij} \quad \text{for } j = 0, 1, \ldots, p$$

Each equation involves all parameters through the sigmoid. The parameters are coupled—you cannot solve for one without knowing the others.

Contrast with Linear Systems:

In linear regression, $\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}$ can be decoupled via matrix factorization (LU, QR, Cholesky). The linearity allows us to express solution components independently.

In logistic regression, the nonlinearity means each equation involves transcendental functions of linear combinations of unknowns. No algebraic manipulation can separate them.

A Fundamental Limitation

This isn't a matter of not having discovered the right trick yet. The theory of transcendental equations (going back to Abel and Galois) shows that most systems involving exponentials have no closed-form solutions. The five-degree polynomial generalized to show that most equations are not algebraically solvable.

Fixed-Point Perspective

Another way to understand why iterative methods are necessary is through the fixed-point perspective.

Rewriting the Score Equations:

Consider trying to isolate $\boldsymbol{\theta}$ from $\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$.

If we could invert, we'd want something like:

$$\boldsymbol{\theta} = g(\boldsymbol{\pi}(\boldsymbol{\theta}))$$

But $\boldsymbol{\pi}$ depends on $\boldsymbol{\theta}$, so we get:

$$\boldsymbol{\theta} = g(\boldsymbol{\pi}(\boldsymbol{\theta})) = f(\boldsymbol{\theta})$$

This is a fixed-point equation: we seek $\boldsymbol{\theta}$ such that applying $f$ returns $\boldsymbol{\theta}$ unchanged.

Fixed-Point Iteration:

The natural approach is iteration:

$$\boldsymbol{\theta}^{(t+1)} = f(\boldsymbol{\theta}^{(t)})$$

Starting from some $\boldsymbol{\theta}^{(0)}$, we repeatedly apply $f$ until convergence. This is exactly what iterative optimization does—each step moves closer to the fixed point (the MLE).

The Contraction Mapping Theorem:

If $f$ is a contraction (i.e., $|f(\boldsymbol{\theta}) - f(\boldsymbol{\theta}')| \leq L|\boldsymbol{\theta} - \boldsymbol{\theta}'|$ for some $L < 1$), then:

A unique fixed point exists
Iteration converges from any starting point
Convergence is geometric with rate $L$

The sophisticated optimization methods we'll study (Newton-Raphson, IRLS) are designed to be fast contractions near the solution.

Fixed-Point Iteration Demo
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def naive_fixed_point_iteration(X, y, max_iter=1000, tol=1e-6):
    """
    Attempt naive fixed-point iteration (for illustration).
    
    The score equation X^T π = X^T y suggests:
    π = (X^T)^{-1} X^T y  (if X^T were invertible)
    
    But π = σ(Xθ), and we can't invert σ directly on vectors.
    
    This demonstrates why simple iteration doesn't work.
    """
    n, p = X.shape
    theta = np.zeros(p)
    
    for iteration in range(max_iter):
        # Forward pass
        z = X @ theta
        pi = sigmoid(z)
        
        # Residual
        residual = y - pi
        
        # Gradient
        gradient = X.T @ residual
        
        # Naive update: θ += η * gradient (this IS gradient ascent!)
        # There's no "closed form iteration" without choosing a step size
        
        if np.linalg.norm(gradient) < tol:
            print(f"Converged at iteration {iteration}")
            break
        
        # We MUST choose a learning rate - this is optimization, not closed-form
        eta = 0.1
        theta = theta + eta * gradient
    
    return theta, gradient
 
# This demonstrates that even "fixed-point" thinking leads to iteration
np.random.seed(42)
n, p = 50, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.array([0.5, 1.0, -0.5])
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
theta_est, final_grad = naive_fixed_point_iteration(X, y)
print(f"\nEstimated θ: {theta_est}")
print(f"True θ:      {theta_true}")
print(f"Final gradient norm: {np.linalg.norm(final_grad):.6f}")
print("\nNote: This is just gradient ascent in disguise!")

The Inescapable Iteration

No matter how you formulate the problem—as fixed-point, as root-finding, as optimization—you end up with iteration. The question is not whether to iterate, but how to iterate efficiently. This motivates the algorithms in the next page.

Special Cases and Exceptions

While there's no closed-form in general, it's worth examining special cases to understand the structure better.

Case 1: Orthogonal Design with Single Feature

Consider $n$ observations with a single binary feature $x \in {0, 1}$ (plus intercept). Let:

$n_0$ = observations with $x = 0$
$n_1$ = observations with $x = 1$
$y_{0+}$ = number of positives among $x = 0$ group
$y_{1+}$ = number of positives among $x = 1$ group

The MLEs can be computed:

$$\hat{\pi}0 = \frac{y{0+}}{n_0}, \quad \hat{\pi}1 = \frac{y{1+}}{n_1}$$

Then: $\hat{\theta}_0 = \log\frac{\hat{\pi}_0}{1-\hat{\pi}_0}$ (log-odds in group 0) And: $\hat{\theta}_1 = \log\frac{\hat{\pi}_1}{1-\hat{\pi}_1} - \hat{\theta}_0$ (log-odds ratio)

But this "closed form" is only for the probability space; the logit transformation is still needed.

Case 2: Perfect Separation

When one class can be perfectly separated from the other by a hyperplane:

$$\exists \boldsymbol{\theta}: \boldsymbol{\theta}^T\mathbf{x}_i > 0 \text{ for all } y_i = 1 \text{ and } \boldsymbol{\theta}^T\mathbf{x}_i < 0 \text{ for all } y_i = 0$$

In this case, the MLE does not exist—parameters diverge to $\pm\infty$. The likelihood can be made arbitrarily close to 1 by scaling $\boldsymbol{\theta}$.

Case 3: Complete Separation

Even stricter than perfect separation: when classes have no overlap in feature space, The log-likelihood is bounded but no finite maximizer exists.

Summary of Special Cases
Case	Description	MLE Behavior
General Case	Overlapping classes, no exact formula	Finite MLE exists, found by iteration
Orthogonal/Simple	Special structure allows simplification	Can compute closed-form for probabilities, then transform
Perfect Separation	A hyperplane separates classes exactly	MLE doesn't exist; parameters → ±∞
Quasi-Separation	Hyperplane separates most but not all	Some coefficients → ±∞; others finite

Separation Is a Real Problem

In high-dimensional settings (p > n), perfect separation is nearly guaranteed. This is one motivation for regularization: L2 regularization ensures a unique, finite MLE exists even with separation. Regularization isn't just about generalization—it's about having a well-defined optimization problem.

What About Approximate Closed Forms?

Given the lack of exact solutions, researchers have explored approximate closed-form solutions. These are useful for understanding and for initialization.

Taylor Expansion Approach:

Linearizing the sigmoid around some point $\boldsymbol{\theta}_0$:

$$\sigma(z) \approx \sigma(z_0) + \sigma'(z_0)(z - z_0)$$

This leads to the Iteratively Reweighted Least Squares (IRLS) update, which we'll study in the next page. Each iteration solves a weighted linear regression—"almost" closed-form, repeated.

Working Response Linearization:

At each iteration, define "working responses" and "working weights" that convert the problem to weighted least squares:

$$\tilde{y}_i = z_i + \frac{y_i - \pi_i}{\pi_i(1-\pi_i)}$$ $$w_i = \pi_i(1-\pi_i)$$

Then the update is:

$$\boldsymbol{\theta}^{\text{new}} = (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}\tilde{\mathbf{y}}$$

This looks like weighted least squares! But $\mathbf{W}$ and $\tilde{\mathbf{y}}$ depend on $\boldsymbol{\theta}^{\text{old}}$, so iteration is still required.

One-Step Estimators:

For inference purposes, sometimes a single Newton or IRLS step from a good initial estimate is "good enough":

$$\hat{\boldsymbol{\theta}}_{\text{one-step}} = \boldsymbol{\theta}^{(0)} - \mathbf{H}(\boldsymbol{\theta}^{(0)})^{-1} \nabla \ell(\boldsymbol{\theta}^{(0)})$$

If $\boldsymbol{\theta}^{(0)}$ is $\sqrt{n}$-consistent, the one-step estimator is asymptotically efficient. But you still need iteration to get a good $\boldsymbol{\theta}^{(0)}$.

The IRLS Insight

IRLS is the bridge between 'no closed form' and 'this is manageable.' Each step solves a weighted linear system (a well-understood problem with efficient algorithms). The weights and responses change between iterations, but within each step, we're doing familiar linear algebra. This is why logistic regression is computationally tractable despite lacking a closed-form solution.

Implications for Algorithm Design

The lack of a closed-form solution has profound implications for how we design and implement logistic regression.

Initialization Matters:

Without a formula that gives the answer directly, we must start somewhere. The choice of initial $\boldsymbol{\theta}^{(0)}$ affects:

Number of iterations to converge
Computational cost
(For non-convex problems) which solution we find

For logistic regression (convex), any starting point works, but smart initialization (e.g., from a linear regression on log-odds) can save iterations.

Convergence Criteria:

We must decide when to stop iterating:

Gradient norm: $|\nabla \ell| < \epsilon$
Parameter change: $|\boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}| < \epsilon$
Log-likelihood change: $|\ell^{(t+1)} - \ell^{(t)}| < \epsilon$
Maximum iterations exceeded

Each criterion has tradeoffs; combining multiple is common practice.

Algorithm Selection:

Different algorithms have different convergence properties:

Common Optimization Algorithms

•Gradient Descent: Simple, but slow convergence (linear rate)
•Newton-Raphson: Fast convergence (quadratic rate), but expensive per iteration ($O(p^3)$ matrix inversion)
•IRLS: Equivalent to Newton for canonical links; interpretable as weighted least squares
•L-BFGS: Quasi-Newton; good balance of speed and memory for large $p$
•Coordinate Descent: Efficient for regularized problems (Lasso, Elastic Net)
•Stochastic Gradient Descent: Scales to massive datasets; trades iteration cost for noise

Practical Considerations:

In production systems, the lack of closed-form means:

Training time scales with iteration count, not just data size
Numerical issues (overflow, underflow) can occur during iteration
Early stopping provides implicit regularization
Warm-starting (using previous solutions as initialization) accelerates retraining

The Blessing of Convexity

Without convexity, the lack of closed-form would be far more problematic—we'd face local minima, saddle points, and sensitivity to initialization. Because logistic regression is convex, we can confidently state: 'iterate until gradient ≈ 0, and you have the global optimum.' This guarantee is precious and shouldn't be taken for granted.

Summary and Preview

We've established a fundamental fact about logistic regression: the MLE cannot be computed in closed form. This isn't a failure of technique—it's a structural property of the problem.

Key Takeaways

•Linear Regression Has Closed Form: Because the objective is quadratic and the normal equations are linear
•Logistic Regression Does Not: Because the sigmoid introduces transcendental nonlinearity that couples all parameters
•Score Equations Are Nonlinear: $\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$ cannot be solved algebraically
•Fixed-Point Perspective: Any reformulation leads to iteration; the question is how to iterate efficiently
•Special Cases Exist: But are limited (orthogonal designs) or problematic (perfect separation where MLE doesn't exist)
•Approximate Closed Forms: IRLS gives 'almost' closed form per step, motivating practical algorithms
•Design Implications: Initialization, convergence criteria, and algorithm choice become important decisions

What's Next:

We've established that iteration is necessary. The final page of this module develops the major optimization approaches for logistic regression: gradient descent, Newton-Raphson, and IRLS. We'll see how the concavity we proved earlier, combined with the gradient we derived, enables efficient algorithms that reliably find the global maximum.

Page Complete

You now understand why logistic regression lacks a closed-form solution, can trace the mathematical obstacles to the sigmoid nonlinearity, and appreciate why iterative optimization is not just convenient but necessary. Next, we'll master the algorithms that make training efficient.

4 / 5

Loading learning content...

Machine LearningLogistic Regression & GLMs

Maximum Likelihood Estimation for Logistic Regression

LevelIntermediate

Duration120 mins

TopicLogistic Regression & GLMs

4 / 5

No Closed-Form Solution

Why We Cannot Solve Directly

In linear regression, we have the elegant closed-form solution:

$$\hat{\boldsymbol{\theta}}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

One matrix inversion, and we're done. The solution emerges directly from the normal equations without any iteration.

What You Will Learn

Review: Why Linear Regression Has a Closed-Form

Let's first understand what makes linear regression special. The key is that the objective function is quadratic in the parameters.

Linear Regression Setup:

Model: $\hat{y}_i = \boldsymbol{\theta}^T \mathbf{x}_i$
Loss: Squared error $\mathcal{J}(\boldsymbol{\theta}) = \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\theta}|^2$
Expanded: $\mathcal{J}(\boldsymbol{\theta}) = \frac{1}{2}(\mathbf{y}^T\mathbf{y} - 2\mathbf{y}^T\mathbf{X}\boldsymbol{\theta} + \boldsymbol{\theta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\theta})$

Key Observation: This is a quadratic function of $\boldsymbol{\theta}$:

Constant term: $\frac{1}{2}\mathbf{y}^T\mathbf{y}$
Linear term: $-\mathbf{y}^T\mathbf{X}\boldsymbol{\theta}$
Quadratic term: $\frac{1}{2}\boldsymbol{\theta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\theta}$

Taking the Gradient:

$$\nabla_\theta \mathcal{J} = -\mathbf{X}^T\mathbf{y} + \mathbf{X}^T\mathbf{X}\boldsymbol{\theta}$$

This is linear in $\boldsymbol{\theta}$!

Setting Gradient to Zero:

$$\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}$$

This is a linear system of equations. Linear systems can be solved directly using matrix algebra:

$$\hat{\boldsymbol{\theta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

The Pattern

The Logistic Regression Score Equations

Now let's examine the logistic regression case and see where the chain breaks.

Logistic Regression Log-Likelihood:

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i (\boldsymbol{\theta}^T\mathbf{x}_i) - \log(1 + e^{\boldsymbol{\theta}^T\mathbf{x}_i}) \right]$$

The Gradient (Score Function):

$$\nabla_\theta \ell = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$$

where $\boldsymbol{\pi} = \sigma(\mathbf{X}\boldsymbol{\theta})$ and $\sigma$ is the sigmoid function.

The Score Equations (First-Order Conditions):

Setting the gradient to zero for a maximum:

$$\mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi}) = \mathbf{0}$$

Or equivalently:

$$\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$$

The Problem:

This equation looks superficially similar to the linear regression normal equations. But there's a crucial difference: $\boldsymbol{\pi}$ depends on $\boldsymbol{\theta}$ nonlinearly!

$$\boldsymbol{\pi}(\boldsymbol{\theta}) = \frac{1}{1 + e^{-\mathbf{X}\boldsymbol{\theta}}}$$

The sigmoid function introduces exponentials that cannot be algebraically inverted.

Comparing the Score Equations
Property	Linear Regression	Logistic Regression
Score Equation	$\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}$	$\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$
LHS Dependence on $\boldsymbol{\theta}$	Linear	Nonlinear (sigmoid)
Equation Type	Linear system	Nonlinear system
Solution Method	Matrix inversion	Iterative optimization
Closed Form?	Yes ✓	No ✗

The Mathematical Reality

Mathematical Analysis of the Nonlinearity

Let's dig deeper into why the nonlinearity blocks algebraic solutions.

The Score Equation, Expanded:

Component $j$ of the score equation is:

$$\sum_{i=1}^{n} \left( y_i - \frac{1}{1 + e^{-\sum_{k=0}^{p} \theta_k x_{ik}}} \right) x_{ij} = 0$$

This is a transcendental equation in the parameters ${\theta_0, \ldots, \theta_p}$.

Why Transcendental Equations Are Hard:

A transcendental equation contains transcendental functions (exponentials, logarithms, trigonometric functions) that cannot be expressed using a finite number of algebraic operations.

Consider the simplest case: one observation, one parameter (no intercept):

$$y_1 - \frac{1}{1 + e^{-\theta_1 x_{11}}} = 0$$

Solving for $\theta_1$:

$$\frac{1}{1 + e^{-\theta_1 x_{11}}} = y_1$$

$$1 + e^{-\theta_1 x_{11}} = \frac{1}{y_1}$$

$$e^{-\theta_1 x_{11}} = \frac{1-y_1}{y_1}$$

Multiple Observations Create a Coupled System:

With $n > 1$ observations, we have $p+1$ equations:

$$\sum_{i=1}^{n} \sigma(\boldsymbol{\theta}^T\mathbf{x}i) x{ij} = \sum_{i=1}^{n} y_i x_{ij} \quad \text{for } j = 0, 1, \ldots, p$$

Each equation involves all parameters through the sigmoid. The parameters are coupled—you cannot solve for one without knowing the others.

Contrast with Linear Systems:

In logistic regression, the nonlinearity means each equation involves transcendental functions of linear combinations of unknowns. No algebraic manipulation can separate them.

A Fundamental Limitation

Fixed-Point Perspective

Another way to understand why iterative methods are necessary is through the fixed-point perspective.

Rewriting the Score Equations:

Consider trying to isolate $\boldsymbol{\theta}$ from $\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$.

If we could invert, we'd want something like:

$$\boldsymbol{\theta} = g(\boldsymbol{\pi}(\boldsymbol{\theta}))$$

But $\boldsymbol{\pi}$ depends on $\boldsymbol{\theta}$, so we get:

$$\boldsymbol{\theta} = g(\boldsymbol{\pi}(\boldsymbol{\theta})) = f(\boldsymbol{\theta})$$

This is a fixed-point equation: we seek $\boldsymbol{\theta}$ such that applying $f$ returns $\boldsymbol{\theta}$ unchanged.

Fixed-Point Iteration:

The natural approach is iteration:

$$\boldsymbol{\theta}^{(t+1)} = f(\boldsymbol{\theta}^{(t)})$$

Starting from some $\boldsymbol{\theta}^{(0)}$, we repeatedly apply $f$ until convergence. This is exactly what iterative optimization does—each step moves closer to the fixed point (the MLE).

The Contraction Mapping Theorem:

If $f$ is a contraction (i.e., $|f(\boldsymbol{\theta}) - f(\boldsymbol{\theta}')| \leq L|\boldsymbol{\theta} - \boldsymbol{\theta}'|$ for some $L < 1$), then:

A unique fixed point exists
Iteration converges from any starting point
Convergence is geometric with rate $L$

The sophisticated optimization methods we'll study (Newton-Raphson, IRLS) are designed to be fast contractions near the solution.

Fixed-Point Iteration Demo
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)), 
                    np.exp(z) / (1 + np.exp(z)))
 
def naive_fixed_point_iteration(X, y, max_iter=1000, tol=1e-6):
    """
    Attempt naive fixed-point iteration (for illustration).
    
    The score equation X^T π = X^T y suggests:
    π = (X^T)^{-1} X^T y  (if X^T were invertible)
    
    But π = σ(Xθ), and we can't invert σ directly on vectors.
    
    This demonstrates why simple iteration doesn't work.
    """
    n, p = X.shape
    theta = np.zeros(p)
    
    for iteration in range(max_iter):
        # Forward pass
        z = X @ theta
        pi = sigmoid(z)
        
        # Residual
        residual = y - pi
        
        # Gradient
        gradient = X.T @ residual
        
        # Naive update: θ += η * gradient (this IS gradient ascent!)
        # There's no "closed form iteration" without choosing a step size
        
        if np.linalg.norm(gradient) < tol:
            print(f"Converged at iteration {iteration}")
            break
        
        # We MUST choose a learning rate - this is optimization, not closed-form
        eta = 0.1
        theta = theta + eta * gradient
    
    return theta, gradient
 
# This demonstrates that even "fixed-point" thinking leads to iteration
np.random.seed(42)
n, p = 50, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.array([0.5, 1.0, -0.5])
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
theta_est, final_grad = naive_fixed_point_iteration(X, y)
print(f"\nEstimated θ: {theta_est}")
print(f"True θ:      {theta_true}")
print(f"Final gradient norm: {np.linalg.norm(final_grad):.6f}")
print("\nNote: This is just gradient ascent in disguise!")

The Inescapable Iteration

Special Cases and Exceptions

While there's no closed-form in general, it's worth examining special cases to understand the structure better.

Case 1: Orthogonal Design with Single Feature

Consider $n$ observations with a single binary feature $x \in {0, 1}$ (plus intercept). Let:

$n_0$ = observations with $x = 0$
$n_1$ = observations with $x = 1$
$y_{0+}$ = number of positives among $x = 0$ group
$y_{1+}$ = number of positives among $x = 1$ group

The MLEs can be computed:

$$\hat{\pi}0 = \frac{y{0+}}{n_0}, \quad \hat{\pi}1 = \frac{y{1+}}{n_1}$$

Then: $\hat{\theta}_0 = \log\frac{\hat{\pi}_0}{1-\hat{\pi}_0}$ (log-odds in group 0) And: $\hat{\theta}_1 = \log\frac{\hat{\pi}_1}{1-\hat{\pi}_1} - \hat{\theta}_0$ (log-odds ratio)

But this "closed form" is only for the probability space; the logit transformation is still needed.

Case 2: Perfect Separation

When one class can be perfectly separated from the other by a hyperplane:

$$\exists \boldsymbol{\theta}: \boldsymbol{\theta}^T\mathbf{x}_i > 0 \text{ for all } y_i = 1 \text{ and } \boldsymbol{\theta}^T\mathbf{x}_i < 0 \text{ for all } y_i = 0$$

In this case, the MLE does not exist—parameters diverge to $\pm\infty$. The likelihood can be made arbitrarily close to 1 by scaling $\boldsymbol{\theta}$.

Case 3: Complete Separation

Even stricter than perfect separation: when classes have no overlap in feature space, The log-likelihood is bounded but no finite maximizer exists.

Summary of Special Cases
Case	Description	MLE Behavior
General Case	Overlapping classes, no exact formula	Finite MLE exists, found by iteration
Orthogonal/Simple	Special structure allows simplification	Can compute closed-form for probabilities, then transform
Perfect Separation	A hyperplane separates classes exactly	MLE doesn't exist; parameters → ±∞
Quasi-Separation	Hyperplane separates most but not all	Some coefficients → ±∞; others finite

Separation Is a Real Problem

What About Approximate Closed Forms?

Given the lack of exact solutions, researchers have explored approximate closed-form solutions. These are useful for understanding and for initialization.

Taylor Expansion Approach:

Linearizing the sigmoid around some point $\boldsymbol{\theta}_0$:

$$\sigma(z) \approx \sigma(z_0) + \sigma'(z_0)(z - z_0)$$

This leads to the Iteratively Reweighted Least Squares (IRLS) update, which we'll study in the next page. Each iteration solves a weighted linear regression—"almost" closed-form, repeated.

Working Response Linearization:

At each iteration, define "working responses" and "working weights" that convert the problem to weighted least squares:

$$\tilde{y}_i = z_i + \frac{y_i - \pi_i}{\pi_i(1-\pi_i)}$$ $$w_i = \pi_i(1-\pi_i)$$

Then the update is:

$$\boldsymbol{\theta}^{\text{new}} = (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}\tilde{\mathbf{y}}$$

This looks like weighted least squares! But $\mathbf{W}$ and $\tilde{\mathbf{y}}$ depend on $\boldsymbol{\theta}^{\text{old}}$, so iteration is still required.

One-Step Estimators:

For inference purposes, sometimes a single Newton or IRLS step from a good initial estimate is "good enough":

$$\hat{\boldsymbol{\theta}}_{\text{one-step}} = \boldsymbol{\theta}^{(0)} - \mathbf{H}(\boldsymbol{\theta}^{(0)})^{-1} \nabla \ell(\boldsymbol{\theta}^{(0)})$$

If $\boldsymbol{\theta}^{(0)}$ is $\sqrt{n}$-consistent, the one-step estimator is asymptotically efficient. But you still need iteration to get a good $\boldsymbol{\theta}^{(0)}$.

The IRLS Insight

Implications for Algorithm Design

The lack of a closed-form solution has profound implications for how we design and implement logistic regression.

Initialization Matters:

Without a formula that gives the answer directly, we must start somewhere. The choice of initial $\boldsymbol{\theta}^{(0)}$ affects:

Number of iterations to converge
Computational cost
(For non-convex problems) which solution we find

For logistic regression (convex), any starting point works, but smart initialization (e.g., from a linear regression on log-odds) can save iterations.

Convergence Criteria:

We must decide when to stop iterating:

Gradient norm: $|\nabla \ell| < \epsilon$
Parameter change: $|\boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}| < \epsilon$
Log-likelihood change: $|\ell^{(t+1)} - \ell^{(t)}| < \epsilon$
Maximum iterations exceeded

Each criterion has tradeoffs; combining multiple is common practice.

Algorithm Selection:

Different algorithms have different convergence properties:

Common Optimization Algorithms

•Gradient Descent: Simple, but slow convergence (linear rate)
•Newton-Raphson: Fast convergence (quadratic rate), but expensive per iteration ($O(p^3)$ matrix inversion)
•IRLS: Equivalent to Newton for canonical links; interpretable as weighted least squares
•L-BFGS: Quasi-Newton; good balance of speed and memory for large $p$
•Coordinate Descent: Efficient for regularized problems (Lasso, Elastic Net)
•Stochastic Gradient Descent: Scales to massive datasets; trades iteration cost for noise

Practical Considerations:

In production systems, the lack of closed-form means:

Training time scales with iteration count, not just data size
Numerical issues (overflow, underflow) can occur during iteration
Early stopping provides implicit regularization
Warm-starting (using previous solutions as initialization) accelerates retraining

The Blessing of Convexity

Summary and Preview

We've established a fundamental fact about logistic regression: the MLE cannot be computed in closed form. This isn't a failure of technique—it's a structural property of the problem.

Key Takeaways

•Linear Regression Has Closed Form: Because the objective is quadratic and the normal equations are linear
•Logistic Regression Does Not: Because the sigmoid introduces transcendental nonlinearity that couples all parameters
•Score Equations Are Nonlinear: $\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$ cannot be solved algebraically
•Fixed-Point Perspective: Any reformulation leads to iteration; the question is how to iterate efficiently
•Special Cases Exist: But are limited (orthogonal designs) or problematic (perfect separation where MLE doesn't exist)
•Approximate Closed Forms: IRLS gives 'almost' closed form per step, motivating practical algorithms
•Design Implications: Initialization, convergence criteria, and algorithm choice become important decisions

What's Next:

Page Complete

4 / 5