Machine LearningLogistic Regression & GLMs

Maximum Likelihood Estimation for Logistic Regression

LevelIntermediate

Duration120 mins

TopicLogistic Regression & GLMs

2 / 5

Log-Likelihood

Why Logarithms Transform Everything

In the previous page, we constructed the likelihood function for logistic regression:

$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \pi_i^{y_i} (1 - \pi_i)^{1-y_i}$$

This expression is mathematically elegant but computationally treacherous. When you multiply hundreds or thousands of probabilities—each between 0 and 1—the result rapidly approaches zero. A typical dataset might yield a likelihood of $10^{-500}$ or smaller, far below the smallest representable floating-point number ($\approx 10^{-308}$ for double precision).

But the solution is beautifully simple: take logarithms. The logarithm transforms products into sums, preserves the location of maxima, and converts tiny positive numbers into manageable negative numbers. This single transformation makes maximum likelihood estimation computationally feasible and mathematically elegant.

More profoundly, the log-likelihood reveals structural properties hidden in the likelihood—most importantly, concavity. This concavity guarantees that optimization algorithms will find the global maximum, not get stuck in local optima.

What You Will Learn

By the end of this page, you will understand why the log-likelihood is essential for computation, derive the log-likelihood function for logistic regression step by step, prove its concavity rigorously, and appreciate how this concavity enables efficient optimization. These insights extend to all generalized linear models and neural networks.

Why Take Logarithms?

The logarithm is one of the most powerful tools in applied mathematics. For maximum likelihood estimation, it provides multiple critical benefits:

1. Numerical Stability

Multiplying many probabilities causes numerical underflow:

With $n = 100$ observations, each with $\pi_i = 0.9$: $\mathcal{L} = 0.9^{100} \approx 2.66 \times 10^{-5}$
With $n = 1000$: $\mathcal{L} = 0.9^{1000} \approx 1.75 \times 10^{-46}$
With $n = 10000$: $\mathcal{L} \approx 10^{-458}$ — beyond double precision!

Log-likelihood stays manageable:

$\ell = 100 \cdot \log(0.9) \approx -10.5$
$\ell = 1000 \cdot \log(0.9) \approx -105$
$\ell = 10000 \cdot \log(0.9) \approx -1054$

2. Mathematical Simplification

Products become sums: $\log \prod_i f_i = \sum_i \log f_i$

Sums are much easier to:

Differentiate (the derivative of a sum is the sum of derivatives)
Analyze (sums of independent random variables have well-understood properties)
Compute (addition is faster and more stable than multiplication)

3. Preservation of Optima

Since $\log$ is a strictly increasing function, it preserves the location of maxima and minima:

$$\arg\max_\theta \mathcal{L}(\theta) = \arg\max_\theta \log \mathcal{L}(\theta)$$

Maximizing the log-likelihood gives the same parameter estimates as maximizing the likelihood itself.

The Monotonic Transformation Principle

Any strictly monotonic transformation preserves the location of extrema. We could use $\sqrt{\mathcal{L}}$ or $\mathcal{L}^{0.1}$ instead of $\log \mathcal{L}$—they all yield the same MLE. We choose logarithm because it also transforms products to sums, making derivatives analytically tractable and connecting to information-theoretic quantities.

Deriving the Log-Likelihood

Let us carefully derive the log-likelihood function for logistic regression, step by step.

Starting Point — The Likelihood:

$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \pi_i^{y_i} (1 - \pi_i)^{1-y_i}$$

where $\pi_i = \sigma(\boldsymbol{\theta}^T \mathbf{x}_i) = \frac{1}{1 + e^{-\boldsymbol{\theta}^T \mathbf{x}_i}}$

Step 1: Apply Logarithm

$$\ell(\boldsymbol{\theta}) = \log \mathcal{L}(\boldsymbol{\theta}) = \log \prod_{i=1}^{n} \pi_i^{y_i} (1 - \pi_i)^{1-y_i}$$

Step 2: Convert Product to Sum

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \log \left[ \pi_i^{y_i} (1 - \pi_i)^{1-y_i} \right]$$

Step 3: Apply Logarithm Rules

Using $\log(ab) = \log a + \log b$ and $\log a^b = b \log a$:

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i \log \pi_i + (1 - y_i) \log(1 - \pi_i) \right]$$

The Binary Cross-Entropy Form

This is exactly the binary cross-entropy between the true labels $y_i$ and predicted probabilities $\pi_i$, summed (with negative sign) over all observations. Cross-entropy loss in machine learning is simply negative log-likelihood!

Step 4: Substitute the Sigmoid

Now we substitute $\pi_i = \sigma(z_i)$ where $z_i = \boldsymbol{\theta}^T \mathbf{x}_i$:

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i \log \sigma(z_i) + (1 - y_i) \log(1 - \sigma(z_i)) \right]$$

Step 5: Use Sigmoid Properties

Recall the key sigmoid properties:

$\sigma(z) = \frac{1}{1 + e^{-z}}$
$1 - \sigma(z) = \sigma(-z) = \frac{e^{-z}}{1 + e^{-z}}$
$\log \sigma(z) = -\log(1 + e^{-z})$
$\log(1 - \sigma(z)) = -z - \log(1 + e^{-z})$

Substituting:

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ -y_i \log(1 + e^{-z_i}) + (1-y_i)\left(-z_i - \log(1 + e^{-z_i})\right) \right]$$

Step 6: Simplify

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ -y_i \log(1 + e^{-z_i}) - z_i + y_i z_i - \log(1 + e^{-z_i}) + y_i \log(1 + e^{-z_i}) \right]$$

Collecting terms:

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i z_i - z_i - \log(1 + e^{-z_i}) \right]$$

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i z_i - \log(1 + e^{z_i}) \right]$$

where in the last step we used $z_i + \log(1 + e^{-z_i}) = \log(e^{z_i}) + \log(1 + e^{-z_i}) = \log(e^{z_i} + 1) = \log(1 + e^{z_i})$.

The Canonical Form

The final expression $\ell(\boldsymbol{\theta}) = \sum_i [y_i z_i - \log(1 + e^{z_i})]$ is elegant: a linear term $y_i z_i$ minus the log-partition function $\log(1 + e^{z_i})$. This is the canonical form for exponential family log-likelihoods:

Alternative Formulations

Different formulations of the log-likelihood are useful in different contexts. Let's catalog the main variants.

Formulation 1: Standard Form

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i \log \pi_i + (1-y_i) \log(1-\pi_i) \right]$$

Intuition: Each observation contributes the log-probability of its actual outcome.

Formulation 2: Exponential Family Form

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i (\boldsymbol{\theta}^T \mathbf{x}_i) - \log(1 + e^{\boldsymbol{\theta}^T \mathbf{x}_i}) \right]$$

Intuition: Linear in the natural parameter minus the log-normalizer.

Formulation 3: Compact Vector Form

$$\ell(\boldsymbol{\theta}) = \mathbf{y}^T \mathbf{X} \boldsymbol{\theta} - \mathbf{1}^T \log(\mathbf{1} + \exp(\mathbf{X}\boldsymbol{\theta}))$$

Intuition: Matrix operations for efficient computation.

Formulation 4: Negative Cross-Entropy (Loss Form)

$$\mathcal{J}(\boldsymbol{\theta}) = -\frac{1}{n} \ell(\boldsymbol{\theta}) = \frac{1}{n} \sum_{i=1}^{n} \left[ -y_i \log \pi_i - (1-y_i) \log(1-\pi_i) \right]$$

Intuition: The "loss" we minimize; average cross-entropy per observation.

Log-Likelihood Implementations
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
 
def sigmoid(z):
    """Numerically stable sigmoid function."""
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z))
    )
 
def log_likelihood_standard(theta, X, y):
    """
    Standard form: Σ [y_i log(π_i) + (1-y_i) log(1-π_i)]
    
    This form is intuitive but requires careful handling
    to avoid log(0).
    """
    z = X @ theta
    pi = sigmoid(z)
    
    # Clip probabilities to avoid log(0)
    eps = 1e-15
    pi = np.clip(pi, eps, 1 - eps)
    
    return np.sum(y * np.log(pi) + (1 - y) * np.log(1 - pi))
 
def log_likelihood_exponential(theta, X, y):
    """
    Exponential family form: Σ [y_i * z_i - log(1 + exp(z_i))]
    
    This form is numerically more stable but still needs
    care with large z values.
    """
    z = X @ theta
    
    # Use log1p for numerical stability when z is large
    # log(1 + exp(z)) ≈ z when z >> 0
    log_partition = np.where(
        z > 20,
        z,  # For large z: log(1 + exp(z)) ≈ z
        np.log(1 + np.exp(z))
    )
    
    return np.sum(y * z - log_partition)
 
def log_likelihood_stable(theta, X, y):
    """
    Most numerically stable form using log-sum-exp trick.
    
    This is the preferred implementation for production code.
    """
    z = X @ theta
    
    # Stable computation of log(1 + exp(z))
    # = max(0, z) + log(1 + exp(-|z|))
    log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
    
    return np.sum(y * z - log_partition)
 
# Verify all formulations give same result
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.array([0.5, 1.0, -0.5])
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
# Test with some arbitrary theta
theta_test = np.array([0.3, 0.8, -0.3])
 
ll_standard = log_likelihood_standard(theta_test, X, y)
ll_exp = log_likelihood_exponential(theta_test, X, y)
ll_stable = log_likelihood_stable(theta_test, X, y)
 
print("Log-likelihood values (should be equal):")
print(f"  Standard form:      {ll_standard:.6f}")
print(f"  Exponential form:   {ll_exp:.6f}")
print(f"  Stable form:        {ll_stable:.6f}")

Numerical Stability Matters

In production code, always use numerically stable formulations. The naive computation of $\log(1 + e^z)$ overflows for large $z$, and $\log(\pi)$ underflows when $\pi$ is near 0. The stable implementations above handle these edge cases correctly.

Concavity of the Log-Likelihood

The most important structural property of the logistic regression log-likelihood is concavity. A function is concave if its Hessian (matrix of second derivatives) is negative semi-definite everywhere.

Why Concavity Matters:

Global Optimum: Any local maximum is the global maximum
Convex Optimization: Standard algorithms provably converge
Unique Solution: The MLE is unique (when it exists)
Efficient Computation: Newton's method converges quadratically

Proving Concavity:

We prove concavity by showing the Hessian is negative semi-definite. Consider the log-likelihood of a single observation (the sum inherits concavity from its summands):

$$\ell_i(\boldsymbol{\theta}) = y_i (\boldsymbol{\theta}^T \mathbf{x}_i) - \log(1 + e^{\boldsymbol{\theta}^T \mathbf{x}_i})$$

Step 1: First Derivative (Gradient)

Let $z_i = \boldsymbol{\theta}^T \mathbf{x}_i$ and $\pi_i = \sigma(z_i)$.

$$\frac{\partial \ell_i}{\partial \boldsymbol{\theta}} = y_i \mathbf{x}_i - \frac{e^{z_i}}{1 + e^{z_i}} \mathbf{x}_i = y_i \mathbf{x}_i - \pi_i \mathbf{x}_i = (y_i - \pi_i) \mathbf{x}_i$$

Step 2: Second Derivative (Hessian)

$$\frac{\partial^2 \ell_i}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T} = -\frac{\partial \pi_i}{\partial \boldsymbol{\theta}} \mathbf{x}_i^T$$

Using $\frac{\partial \pi_i}{\partial z_i} = \pi_i(1 - \pi_i)$ and $\frac{\partial z_i}{\partial \boldsymbol{\theta}} = \mathbf{x}_i$:

$$\frac{\partial^2 \ell_i}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T} = -\pi_i(1 - \pi_i) \mathbf{x}_i \mathbf{x}_i^T$$

Step 3: Sum Over Observations

The total Hessian is:

$$\mathbf{H}(\boldsymbol{\theta}) = \frac{\partial^2 \ell}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T} = -\sum_{i=1}^{n} \pi_i(1-\pi_i) \mathbf{x}_i \mathbf{x}_i^T$$

In matrix notation:

$$\mathbf{H}(\boldsymbol{\theta}) = -\mathbf{X}^T \mathbf{W} \mathbf{X}$$

where $\mathbf{W} = \text{diag}(\pi_1(1-\pi_1), \ldots, \pi_n(1-\pi_n))$ is a diagonal weight matrix.

Step 4: Negative Semi-Definiteness

For any vector $\mathbf{v} \neq \mathbf{0}$:

$$\mathbf{v}^T \mathbf{H} \mathbf{v} = -\mathbf{v}^T \mathbf{X}^T \mathbf{W} \mathbf{X} \mathbf{v} = -(\mathbf{X}\mathbf{v})^T \mathbf{W} (\mathbf{X}\mathbf{v})$$

Since $\pi_i(1-\pi_i) > 0$ for all $\pi_i \in (0, 1)$, the matrix $\mathbf{W}$ is positive definite. Therefore:

$$(\mathbf{X}\mathbf{v})^T \mathbf{W} (\mathbf{X}\mathbf{v}) \geq 0$$

with equality only if $\mathbf{X}\mathbf{v} = \mathbf{0}$.

Conclusion: $\mathbf{v}^T \mathbf{H} \mathbf{v} \leq 0$ for all $\mathbf{v}$, so $\mathbf{H}$ is negative semi-definite. The log-likelihood is concave. ∎

Strict Concavity Conditions

The log-likelihood is strictly concave (guaranteeing a unique maximum) if and only if $\mathbf{X}$ has full column rank and no $\pi_i \in {0, 1}$. If observations are perfectly separable (all $\pi_i \to 0$ or $1$), the MLE doesn't exist—parameters diverge to $\pm\infty$. We'll address this when discussing regularization.

Geometric Interpretation

The concavity of the log-likelihood has beautiful geometric implications that aid intuition.

Level Sets and Contours:

The level sets of the log-likelihood function are convex (for a concave function). As we move outward from the maximum, the log-likelihood decreases smoothly in all directions—there are no "saddle points" or "ridges" that could trap optimization algorithms.

The Parameter Space Landscape:

Imagine standing at a point in parameter space $\boldsymbol{\theta} = (\theta_0, \theta_1, \ldots, \theta_p)$:

The log-likelihood value at that point tells you the "elevation"
Concavity means you're on a smooth hill with a single peak
The gradient points uphill toward the maximum
Following the gradient will always lead you toward the optimal $\hat{\boldsymbol{\theta}}$

Comparison with Non-Convex Landscapes:

Neural networks have non-convex loss surfaces with:

Multiple local minima
Saddle points (flat regions that aren't minima)
Narrow valleys and sharp ridges

Logistic regression's concave log-likelihood has none of these complications—it's a well-behaved optimization surface.

Visualizing Log-Likelihood Concavity
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
 
def log_likelihood_surface(theta0_range, theta1_range, X, y):
    """
    Compute log-likelihood over a grid of parameter values.
    Demonstrates the concave (bowl-shaped) structure.
    """
    T0, T1 = np.meshgrid(theta0_range, theta1_range)
    log_lik = np.zeros_like(T0)
    
    for i in range(len(theta0_range)):
        for j in range(len(theta1_range)):
            theta = np.array([theta0_range[i], theta1_range[j]])
            z = X @ theta
            
            # Stable log-likelihood computation
            log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
            log_lik[j, i] = np.sum(y * z - log_partition)
    
    return T0, T1, log_lik
 
def verify_hessian_negative_definite(theta, X, y):
    """
    Verify the Hessian is negative semi-definite at theta.
    """
    z = X @ theta
    pi = 1 / (1 + np.exp(-z))
    
    # Weight matrix W = diag(π_i(1-π_i))
    w = pi * (1 - pi)
    
    # Hessian H = -X^T W X
    W = np.diag(w)
    H = -X.T @ W @ X
    
    # Check eigenvalues are all ≤ 0
    eigenvalues = np.linalg.eigvalsh(H)
    
    print("Hessian Eigenvalue Analysis:")
    print(f"  Eigenvalues: {eigenvalues}")
    print(f"  All ≤ 0? {np.all(eigenvalues <= 1e-10)}")
    print(f"  → Hessian is negative semi-definite: Log-likelihood is concave!")
    
    return eigenvalues
 
# Generate example data
np.random.seed(42)
n = 100
X = np.column_stack([np.ones(n), np.random.randn(n)])
theta_true = np.array([-1.0, 2.0])
z = X @ theta_true
y = (np.random.rand(n) < 1/(1 + np.exp(-z))).astype(float)
 
# Verify at several random parameter values
for _ in range(3):
    theta_test = np.random.randn(2)
    print(f"\nAt θ = [{theta_test[0]:.2f}, {theta_test[1]:.2f}]:")
    verify_hessian_negative_definite(theta_test, X, y)

The Fisher Information Matrix

The negative expected Hessian $\mathcal{I}(\boldsymbol{\theta}) = -\mathbb{E}[\mathbf{H}(\boldsymbol{\theta})] = \mathbf{X}^T \mathbf{W} \mathbf{X}$ is the Fisher Information Matrix. It measures how much information the data carries about the parameters. Larger Fisher information means more precise parameter estimates (smaller confidence regions).

Score Function and Score Equations

The gradient of the log-likelihood has a special name in statistics: the score function. Understanding it is key to both optimization and statistical inference.

The Score Function:

$$\mathbf{s}(\boldsymbol{\theta}) = \nabla_\theta \ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} (y_i - \pi_i) \mathbf{x}_i = \mathbf{X}^T (\mathbf{y} - \boldsymbol{\pi})$$

where $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_n)^T$ is the vector of predicted probabilities.

Interpretation:

Each observation contributes $(y_i - \pi_i) \mathbf{x}_i$ to the score:

If $y_i = 1$ and $\pi_i < 1$: positive residual $(y_i - \pi_i) > 0$ pushes parameters to increase $\pi_i$
If $y_i = 0$ and $\pi_i > 0$: negative residual $(y_i - \pi_i) < 0$ pushes parameters to decrease $\pi_i$
The magnitude of the push is proportional to the prediction error $|y_i - \pi_i|$

The Score Equations:

At the maximum likelihood estimate, the score is zero:

$$\mathbf{s}(\hat{\boldsymbol{\theta}}) = \mathbf{X}^T (\mathbf{y} - \hat{\boldsymbol{\pi}}) = \mathbf{0}$$

This is a system of $p+1$ equations (one per parameter) in $p+1$ unknowns.

Moment Interpretation:

The score equations state that, at the MLE, the feature-weighted residuals sum to zero:

$$\sum_{i=1}^{n} (y_i - \hat{\pi}i) x{ij} = 0 \quad \text{for each } j = 0, 1, \ldots, p$$

For the intercept ($j = 0$, $x_{i0} = 1$): $$\sum_{i=1}^{n} y_i = \sum_{i=1}^{n} \hat{\pi}_i$$

The sum of observed 1's equals the sum of predicted probabilities—the model is "calibrated" in aggregate.

Properties of the Score Function

•Zero Mean at True Parameters: $\mathbb{E}_{\theta^}[\mathbf{s}(\boldsymbol{\theta}^)] = \mathbf{0}$
•Covariance Equals Fisher Information: $\text{Cov}[\mathbf{s}(\boldsymbol{\theta}^)] = \mathcal{I}(\boldsymbol{\theta}^)$
•Connection to Optimization: Setting score to zero is the first-order optimality condition
•Connection to Inference: Score-based tests (Lagrange multiplier test) assess hypotheses about parameters
•Asymptotic Normality: $\sqrt{n}(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}^) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \mathcal{I}(\boldsymbol{\theta}^)^{-1})$

Comparison with Linear Regression

It's illuminating to compare the log-likelihood structure of logistic regression with linear regression. Both fit within the generalized linear model framework.

Linear Regression (Gaussian Response):

Model: $y_i = \boldsymbol{\theta}^T \mathbf{x}_i + \epsilon_i$ where $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$

Log-likelihood: $$\ell(\boldsymbol{\theta}) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - \boldsymbol{\theta}^T \mathbf{x}_i)^2$$

Score: $\mathbf{s}(\boldsymbol{\theta}) = \frac{1}{\sigma^2} \mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\theta})$

Setting to zero: $\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}$ → $\hat{\boldsymbol{\theta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

Logistic Regression (Bernoulli Response):

Model: $y_i \sim \text{Bernoulli}(\sigma(\boldsymbol{\theta}^T \mathbf{x}_i))$

Log-likelihood: $$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i (\boldsymbol{\theta}^T \mathbf{x}_i) - \log(1 + e^{\boldsymbol{\theta}^T \mathbf{x}_i}) \right]$$

Score: $\mathbf{s}(\boldsymbol{\theta}) = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$

Setting to zero: $\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$ — nonlinear in $\boldsymbol{\theta}$!

Linear vs. Logistic Regression Log-Likelihood Comparison
Property	Linear Regression	Logistic Regression
Response Distribution	Gaussian (continuous)	Bernoulli (binary)
Link Function	Identity: $\mu = \boldsymbol{\theta}^T\mathbf{x}$	Logit: $\log\frac{\pi}{1-\pi} = \boldsymbol{\theta}^T\mathbf{x}$
Log-Likelihood Form	Quadratic in $\boldsymbol{\theta}$	Nonlinear in $\boldsymbol{\theta}$
Score Equation	Linear: $\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}$	Nonlinear: $\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$
Closed-Form Solution	Yes: $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$	No
Hessian	Constant: $-\frac{1}{\sigma^2}\mathbf{X}^T\mathbf{X}$	Varies: $-\mathbf{X}^T\mathbf{W}(\boldsymbol{\theta})\mathbf{X}$
Optimization	One-shot matrix inversion	Iterative (gradient methods)

The Key Difference

The fundamental difference is that in linear regression, the mapping from parameters to predictions is linear ($\hat{y} = \mathbf{X}\boldsymbol{\theta}$), while in logistic regression, it's nonlinear ($\hat{\pi} = \sigma(\mathbf{X}\boldsymbol{\theta})$). This nonlinearity is why we need iterative optimization methods—the score equations cannot be solved analytically.

Summary and Preview

We have developed the log-likelihood formulation that is central to training logistic regression models. Let's consolidate the key insights.

Key Takeaways

•Why Logarithms: Transform products to sums, prevent numerical underflow, preserve optima, and reveal concavity
•Log-Likelihood Derivation: $\ell(\boldsymbol{\theta}) = \sum_i [y_i z_i - \log(1 + e^{z_i})]$ where $z_i = \boldsymbol{\theta}^T\mathbf{x}_i$
•Cross-Entropy Connection: The negative log-likelihood is exactly the binary cross-entropy loss used in deep learning
•Concavity: The log-likelihood is concave with Hessian $\mathbf{H} = -\mathbf{X}^T\mathbf{W}\mathbf{X}$, guaranteeing a unique global maximum
•Score Function: The gradient $\mathbf{s}(\boldsymbol{\theta}) = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$ measures how to improve the fit
•Score Equations: At the MLE, $\mathbf{X}^T\mathbf{y} = \mathbf{X}^T\hat{\boldsymbol{\pi}}$—feature-weighted predictions match feature-weighted labels
•Linear vs. Logistic: The nonlinear sigmoid link prevents closed-form solutions, requiring iterative optimization

What's Next:

We've established the log-likelihood and proven its concavity, but we haven't yet derived the gradient explicitly in a form suitable for optimization algorithms. The next page develops the gradient derivation in detail—computing the partial derivatives, understanding the backpropagation structure, and setting up the foundation for gradient descent and Newton's method.

Page Complete

You now understand why the log-likelihood is essential for optimization, can derive it from the likelihood, appreciate its concavity (negative semi-definite Hessian), and connect it to cross-entropy loss. Next, we'll derive the gradient explicitly for implementation in optimization algorithms.

2 / 5

Loading learning content...

Machine LearningLogistic Regression & GLMs

Maximum Likelihood Estimation for Logistic Regression

LevelIntermediate

Duration120 mins

TopicLogistic Regression & GLMs

2 / 5

Log-Likelihood

Why Logarithms Transform Everything

In the previous page, we constructed the likelihood function for logistic regression:

$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \pi_i^{y_i} (1 - \pi_i)^{1-y_i}$$

What You Will Learn

Why Take Logarithms?

The logarithm is one of the most powerful tools in applied mathematics. For maximum likelihood estimation, it provides multiple critical benefits:

1. Numerical Stability

Multiplying many probabilities causes numerical underflow:

With $n = 100$ observations, each with $\pi_i = 0.9$: $\mathcal{L} = 0.9^{100} \approx 2.66 \times 10^{-5}$
With $n = 1000$: $\mathcal{L} = 0.9^{1000} \approx 1.75 \times 10^{-46}$
With $n = 10000$: $\mathcal{L} \approx 10^{-458}$ — beyond double precision!

Log-likelihood stays manageable:

$\ell = 100 \cdot \log(0.9) \approx -10.5$
$\ell = 1000 \cdot \log(0.9) \approx -105$
$\ell = 10000 \cdot \log(0.9) \approx -1054$

2. Mathematical Simplification

Products become sums: $\log \prod_i f_i = \sum_i \log f_i$

Sums are much easier to:

Differentiate (the derivative of a sum is the sum of derivatives)
Analyze (sums of independent random variables have well-understood properties)
Compute (addition is faster and more stable than multiplication)

3. Preservation of Optima

Since $\log$ is a strictly increasing function, it preserves the location of maxima and minima:

$$\arg\max_\theta \mathcal{L}(\theta) = \arg\max_\theta \log \mathcal{L}(\theta)$$

Maximizing the log-likelihood gives the same parameter estimates as maximizing the likelihood itself.

The Monotonic Transformation Principle

Deriving the Log-Likelihood

Let us carefully derive the log-likelihood function for logistic regression, step by step.

Starting Point — The Likelihood:

$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \pi_i^{y_i} (1 - \pi_i)^{1-y_i}$$

where $\pi_i = \sigma(\boldsymbol{\theta}^T \mathbf{x}_i) = \frac{1}{1 + e^{-\boldsymbol{\theta}^T \mathbf{x}_i}}$

Step 1: Apply Logarithm

$$\ell(\boldsymbol{\theta}) = \log \mathcal{L}(\boldsymbol{\theta}) = \log \prod_{i=1}^{n} \pi_i^{y_i} (1 - \pi_i)^{1-y_i}$$

Step 2: Convert Product to Sum

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \log \left[ \pi_i^{y_i} (1 - \pi_i)^{1-y_i} \right]$$

Step 3: Apply Logarithm Rules

Using $\log(ab) = \log a + \log b$ and $\log a^b = b \log a$:

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i \log \pi_i + (1 - y_i) \log(1 - \pi_i) \right]$$

The Binary Cross-Entropy Form

Step 4: Substitute the Sigmoid

Now we substitute $\pi_i = \sigma(z_i)$ where $z_i = \boldsymbol{\theta}^T \mathbf{x}_i$:

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i \log \sigma(z_i) + (1 - y_i) \log(1 - \sigma(z_i)) \right]$$

Step 5: Use Sigmoid Properties

Recall the key sigmoid properties:

$\sigma(z) = \frac{1}{1 + e^{-z}}$
$1 - \sigma(z) = \sigma(-z) = \frac{e^{-z}}{1 + e^{-z}}$
$\log \sigma(z) = -\log(1 + e^{-z})$
$\log(1 - \sigma(z)) = -z - \log(1 + e^{-z})$

Substituting:

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ -y_i \log(1 + e^{-z_i}) + (1-y_i)\left(-z_i - \log(1 + e^{-z_i})\right) \right]$$

Step 6: Simplify

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ -y_i \log(1 + e^{-z_i}) - z_i + y_i z_i - \log(1 + e^{-z_i}) + y_i \log(1 + e^{-z_i}) \right]$$

Collecting terms:

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i z_i - z_i - \log(1 + e^{-z_i}) \right]$$

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i z_i - \log(1 + e^{z_i}) \right]$$

where in the last step we used $z_i + \log(1 + e^{-z_i}) = \log(e^{z_i}) + \log(1 + e^{-z_i}) = \log(e^{z_i} + 1) = \log(1 + e^{z_i})$.

The Canonical Form

Alternative Formulations

Different formulations of the log-likelihood are useful in different contexts. Let's catalog the main variants.

Formulation 1: Standard Form

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i \log \pi_i + (1-y_i) \log(1-\pi_i) \right]$$

Intuition: Each observation contributes the log-probability of its actual outcome.

Formulation 2: Exponential Family Form

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i (\boldsymbol{\theta}^T \mathbf{x}_i) - \log(1 + e^{\boldsymbol{\theta}^T \mathbf{x}_i}) \right]$$

Intuition: Linear in the natural parameter minus the log-normalizer.

Formulation 3: Compact Vector Form

$$\ell(\boldsymbol{\theta}) = \mathbf{y}^T \mathbf{X} \boldsymbol{\theta} - \mathbf{1}^T \log(\mathbf{1} + \exp(\mathbf{X}\boldsymbol{\theta}))$$

Intuition: Matrix operations for efficient computation.

Formulation 4: Negative Cross-Entropy (Loss Form)

$$\mathcal{J}(\boldsymbol{\theta}) = -\frac{1}{n} \ell(\boldsymbol{\theta}) = \frac{1}{n} \sum_{i=1}^{n} \left[ -y_i \log \pi_i - (1-y_i) \log(1-\pi_i) \right]$$

Intuition: The "loss" we minimize; average cross-entropy per observation.

Log-Likelihood Implementations
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
 
def sigmoid(z):
    """Numerically stable sigmoid function."""
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z))
    )
 
def log_likelihood_standard(theta, X, y):
    """
    Standard form: Σ [y_i log(π_i) + (1-y_i) log(1-π_i)]
    
    This form is intuitive but requires careful handling
    to avoid log(0).
    """
    z = X @ theta
    pi = sigmoid(z)
    
    # Clip probabilities to avoid log(0)
    eps = 1e-15
    pi = np.clip(pi, eps, 1 - eps)
    
    return np.sum(y * np.log(pi) + (1 - y) * np.log(1 - pi))
 
def log_likelihood_exponential(theta, X, y):
    """
    Exponential family form: Σ [y_i * z_i - log(1 + exp(z_i))]
    
    This form is numerically more stable but still needs
    care with large z values.
    """
    z = X @ theta
    
    # Use log1p for numerical stability when z is large
    # log(1 + exp(z)) ≈ z when z >> 0
    log_partition = np.where(
        z > 20,
        z,  # For large z: log(1 + exp(z)) ≈ z
        np.log(1 + np.exp(z))
    )
    
    return np.sum(y * z - log_partition)
 
def log_likelihood_stable(theta, X, y):
    """
    Most numerically stable form using log-sum-exp trick.
    
    This is the preferred implementation for production code.
    """
    z = X @ theta
    
    # Stable computation of log(1 + exp(z))
    # = max(0, z) + log(1 + exp(-|z|))
    log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
    
    return np.sum(y * z - log_partition)
 
# Verify all formulations give same result
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
theta_true = np.array([0.5, 1.0, -0.5])
z = X @ theta_true
y = (np.random.rand(n) < sigmoid(z)).astype(float)
 
# Test with some arbitrary theta
theta_test = np.array([0.3, 0.8, -0.3])
 
ll_standard = log_likelihood_standard(theta_test, X, y)
ll_exp = log_likelihood_exponential(theta_test, X, y)
ll_stable = log_likelihood_stable(theta_test, X, y)
 
print("Log-likelihood values (should be equal):")
print(f"  Standard form:      {ll_standard:.6f}")
print(f"  Exponential form:   {ll_exp:.6f}")
print(f"  Stable form:        {ll_stable:.6f}")

Numerical Stability Matters

Concavity of the Log-Likelihood

Why Concavity Matters:

Global Optimum: Any local maximum is the global maximum
Convex Optimization: Standard algorithms provably converge
Unique Solution: The MLE is unique (when it exists)
Efficient Computation: Newton's method converges quadratically

Proving Concavity:

We prove concavity by showing the Hessian is negative semi-definite. Consider the log-likelihood of a single observation (the sum inherits concavity from its summands):

$$\ell_i(\boldsymbol{\theta}) = y_i (\boldsymbol{\theta}^T \mathbf{x}_i) - \log(1 + e^{\boldsymbol{\theta}^T \mathbf{x}_i})$$

Step 1: First Derivative (Gradient)

Let $z_i = \boldsymbol{\theta}^T \mathbf{x}_i$ and $\pi_i = \sigma(z_i)$.

$$\frac{\partial \ell_i}{\partial \boldsymbol{\theta}} = y_i \mathbf{x}_i - \frac{e^{z_i}}{1 + e^{z_i}} \mathbf{x}_i = y_i \mathbf{x}_i - \pi_i \mathbf{x}_i = (y_i - \pi_i) \mathbf{x}_i$$

Step 2: Second Derivative (Hessian)

$$\frac{\partial^2 \ell_i}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T} = -\frac{\partial \pi_i}{\partial \boldsymbol{\theta}} \mathbf{x}_i^T$$

Using $\frac{\partial \pi_i}{\partial z_i} = \pi_i(1 - \pi_i)$ and $\frac{\partial z_i}{\partial \boldsymbol{\theta}} = \mathbf{x}_i$:

$$\frac{\partial^2 \ell_i}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T} = -\pi_i(1 - \pi_i) \mathbf{x}_i \mathbf{x}_i^T$$

Step 3: Sum Over Observations

The total Hessian is:

$$\mathbf{H}(\boldsymbol{\theta}) = \frac{\partial^2 \ell}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T} = -\sum_{i=1}^{n} \pi_i(1-\pi_i) \mathbf{x}_i \mathbf{x}_i^T$$

In matrix notation:

$$\mathbf{H}(\boldsymbol{\theta}) = -\mathbf{X}^T \mathbf{W} \mathbf{X}$$

where $\mathbf{W} = \text{diag}(\pi_1(1-\pi_1), \ldots, \pi_n(1-\pi_n))$ is a diagonal weight matrix.

Step 4: Negative Semi-Definiteness

For any vector $\mathbf{v} \neq \mathbf{0}$:

$$\mathbf{v}^T \mathbf{H} \mathbf{v} = -\mathbf{v}^T \mathbf{X}^T \mathbf{W} \mathbf{X} \mathbf{v} = -(\mathbf{X}\mathbf{v})^T \mathbf{W} (\mathbf{X}\mathbf{v})$$

Since $\pi_i(1-\pi_i) > 0$ for all $\pi_i \in (0, 1)$, the matrix $\mathbf{W}$ is positive definite. Therefore:

$$(\mathbf{X}\mathbf{v})^T \mathbf{W} (\mathbf{X}\mathbf{v}) \geq 0$$

with equality only if $\mathbf{X}\mathbf{v} = \mathbf{0}$.

Conclusion: $\mathbf{v}^T \mathbf{H} \mathbf{v} \leq 0$ for all $\mathbf{v}$, so $\mathbf{H}$ is negative semi-definite. The log-likelihood is concave. ∎

Strict Concavity Conditions

Geometric Interpretation

The concavity of the log-likelihood has beautiful geometric implications that aid intuition.

Level Sets and Contours:

The Parameter Space Landscape:

Imagine standing at a point in parameter space $\boldsymbol{\theta} = (\theta_0, \theta_1, \ldots, \theta_p)$:

The log-likelihood value at that point tells you the "elevation"
Concavity means you're on a smooth hill with a single peak
The gradient points uphill toward the maximum
Following the gradient will always lead you toward the optimal $\hat{\boldsymbol{\theta}}$

Comparison with Non-Convex Landscapes:

Neural networks have non-convex loss surfaces with:

Multiple local minima
Saddle points (flat regions that aren't minima)
Narrow valleys and sharp ridges

Logistic regression's concave log-likelihood has none of these complications—it's a well-behaved optimization surface.

Visualizing Log-Likelihood Concavity
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
 
def log_likelihood_surface(theta0_range, theta1_range, X, y):
    """
    Compute log-likelihood over a grid of parameter values.
    Demonstrates the concave (bowl-shaped) structure.
    """
    T0, T1 = np.meshgrid(theta0_range, theta1_range)
    log_lik = np.zeros_like(T0)
    
    for i in range(len(theta0_range)):
        for j in range(len(theta1_range)):
            theta = np.array([theta0_range[i], theta1_range[j]])
            z = X @ theta
            
            # Stable log-likelihood computation
            log_partition = np.maximum(0, z) + np.log(1 + np.exp(-np.abs(z)))
            log_lik[j, i] = np.sum(y * z - log_partition)
    
    return T0, T1, log_lik
 
def verify_hessian_negative_definite(theta, X, y):
    """
    Verify the Hessian is negative semi-definite at theta.
    """
    z = X @ theta
    pi = 1 / (1 + np.exp(-z))
    
    # Weight matrix W = diag(π_i(1-π_i))
    w = pi * (1 - pi)
    
    # Hessian H = -X^T W X
    W = np.diag(w)
    H = -X.T @ W @ X
    
    # Check eigenvalues are all ≤ 0
    eigenvalues = np.linalg.eigvalsh(H)
    
    print("Hessian Eigenvalue Analysis:")
    print(f"  Eigenvalues: {eigenvalues}")
    print(f"  All ≤ 0? {np.all(eigenvalues <= 1e-10)}")
    print(f"  → Hessian is negative semi-definite: Log-likelihood is concave!")
    
    return eigenvalues
 
# Generate example data
np.random.seed(42)
n = 100
X = np.column_stack([np.ones(n), np.random.randn(n)])
theta_true = np.array([-1.0, 2.0])
z = X @ theta_true
y = (np.random.rand(n) < 1/(1 + np.exp(-z))).astype(float)
 
# Verify at several random parameter values
for _ in range(3):
    theta_test = np.random.randn(2)
    print(f"\nAt θ = [{theta_test[0]:.2f}, {theta_test[1]:.2f}]:")
    verify_hessian_negative_definite(theta_test, X, y)

The Fisher Information Matrix

Score Function and Score Equations

The gradient of the log-likelihood has a special name in statistics: the score function. Understanding it is key to both optimization and statistical inference.

The Score Function:

$$\mathbf{s}(\boldsymbol{\theta}) = \nabla_\theta \ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} (y_i - \pi_i) \mathbf{x}_i = \mathbf{X}^T (\mathbf{y} - \boldsymbol{\pi})$$

where $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_n)^T$ is the vector of predicted probabilities.

Interpretation:

Each observation contributes $(y_i - \pi_i) \mathbf{x}_i$ to the score:

If $y_i = 1$ and $\pi_i < 1$: positive residual $(y_i - \pi_i) > 0$ pushes parameters to increase $\pi_i$
If $y_i = 0$ and $\pi_i > 0$: negative residual $(y_i - \pi_i) < 0$ pushes parameters to decrease $\pi_i$
The magnitude of the push is proportional to the prediction error $|y_i - \pi_i|$

The Score Equations:

At the maximum likelihood estimate, the score is zero:

$$\mathbf{s}(\hat{\boldsymbol{\theta}}) = \mathbf{X}^T (\mathbf{y} - \hat{\boldsymbol{\pi}}) = \mathbf{0}$$

This is a system of $p+1$ equations (one per parameter) in $p+1$ unknowns.

Moment Interpretation:

The score equations state that, at the MLE, the feature-weighted residuals sum to zero:

$$\sum_{i=1}^{n} (y_i - \hat{\pi}i) x{ij} = 0 \quad \text{for each } j = 0, 1, \ldots, p$$

For the intercept ($j = 0$, $x_{i0} = 1$): $$\sum_{i=1}^{n} y_i = \sum_{i=1}^{n} \hat{\pi}_i$$

The sum of observed 1's equals the sum of predicted probabilities—the model is "calibrated" in aggregate.

Properties of the Score Function

•Zero Mean at True Parameters: $\mathbb{E}_{\theta^}[\mathbf{s}(\boldsymbol{\theta}^)] = \mathbf{0}$
•Covariance Equals Fisher Information: $\text{Cov}[\mathbf{s}(\boldsymbol{\theta}^)] = \mathcal{I}(\boldsymbol{\theta}^)$
•Connection to Optimization: Setting score to zero is the first-order optimality condition
•Connection to Inference: Score-based tests (Lagrange multiplier test) assess hypotheses about parameters
•Asymptotic Normality: $\sqrt{n}(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}^) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \mathcal{I}(\boldsymbol{\theta}^)^{-1})$

Comparison with Linear Regression

It's illuminating to compare the log-likelihood structure of logistic regression with linear regression. Both fit within the generalized linear model framework.

Linear Regression (Gaussian Response):

Model: $y_i = \boldsymbol{\theta}^T \mathbf{x}_i + \epsilon_i$ where $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$

Log-likelihood: $$\ell(\boldsymbol{\theta}) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - \boldsymbol{\theta}^T \mathbf{x}_i)^2$$

Score: $\mathbf{s}(\boldsymbol{\theta}) = \frac{1}{\sigma^2} \mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\theta})$

Setting to zero: $\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}$ → $\hat{\boldsymbol{\theta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

Logistic Regression (Bernoulli Response):

Model: $y_i \sim \text{Bernoulli}(\sigma(\boldsymbol{\theta}^T \mathbf{x}_i))$

Log-likelihood: $$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \left[ y_i (\boldsymbol{\theta}^T \mathbf{x}_i) - \log(1 + e^{\boldsymbol{\theta}^T \mathbf{x}_i}) \right]$$

Score: $\mathbf{s}(\boldsymbol{\theta}) = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$

Setting to zero: $\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$ — nonlinear in $\boldsymbol{\theta}$!

Linear vs. Logistic Regression Log-Likelihood Comparison
Property	Linear Regression	Logistic Regression
Response Distribution	Gaussian (continuous)	Bernoulli (binary)
Link Function	Identity: $\mu = \boldsymbol{\theta}^T\mathbf{x}$	Logit: $\log\frac{\pi}{1-\pi} = \boldsymbol{\theta}^T\mathbf{x}$
Log-Likelihood Form	Quadratic in $\boldsymbol{\theta}$	Nonlinear in $\boldsymbol{\theta}$
Score Equation	Linear: $\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}$	Nonlinear: $\mathbf{X}^T\boldsymbol{\pi}(\boldsymbol{\theta}) = \mathbf{X}^T\mathbf{y}$
Closed-Form Solution	Yes: $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$	No
Hessian	Constant: $-\frac{1}{\sigma^2}\mathbf{X}^T\mathbf{X}$	Varies: $-\mathbf{X}^T\mathbf{W}(\boldsymbol{\theta})\mathbf{X}$
Optimization	One-shot matrix inversion	Iterative (gradient methods)

The Key Difference

Summary and Preview

We have developed the log-likelihood formulation that is central to training logistic regression models. Let's consolidate the key insights.

Key Takeaways

•Why Logarithms: Transform products to sums, prevent numerical underflow, preserve optima, and reveal concavity
•Log-Likelihood Derivation: $\ell(\boldsymbol{\theta}) = \sum_i [y_i z_i - \log(1 + e^{z_i})]$ where $z_i = \boldsymbol{\theta}^T\mathbf{x}_i$
•Cross-Entropy Connection: The negative log-likelihood is exactly the binary cross-entropy loss used in deep learning
•Concavity: The log-likelihood is concave with Hessian $\mathbf{H} = -\mathbf{X}^T\mathbf{W}\mathbf{X}$, guaranteeing a unique global maximum
•Score Function: The gradient $\mathbf{s}(\boldsymbol{\theta}) = \mathbf{X}^T(\mathbf{y} - \boldsymbol{\pi})$ measures how to improve the fit
•Score Equations: At the MLE, $\mathbf{X}^T\mathbf{y} = \mathbf{X}^T\hat{\boldsymbol{\pi}}$—feature-weighted predictions match feature-weighted labels
•Linear vs. Logistic: The nonlinear sigmoid link prevents closed-form solutions, requiring iterative optimization

What's Next:

Page Complete

2 / 5