Machine LearningRegularized Logistic Regression

Regularized Logistic Regression

LevelIntermediate

Duration90 mins

TopicRegularized Logistic Regression

1 / 5

L2 Regularization in Logistic Regression

The Regularization Imperative

Logistic regression, despite its conceptual elegance and interpretability, faces a fundamental challenge that becomes critical as model complexity increases: overfitting. When we fit a logistic regression model to data—particularly in high-dimensional settings where the number of features approaches or exceeds the number of observations—the maximum likelihood estimator can exhibit pathological behavior that renders the model useless for prediction.

L2 regularization (also known as Ridge regularization or Tikhonov regularization in the broader numerical analysis context) provides a principled solution to this problem by introducing a penalty term that constrains the magnitude of the model coefficients. This page develops the complete theory of L2-regularized logistic regression, from mathematical foundations to practical implementation.

Prerequisites

This page assumes familiarity with standard logistic regression, maximum likelihood estimation, and basic convex optimization. We build upon the logistic regression model and MLE concepts from earlier modules, extending them with regularization machinery.

The Overfitting Problem in Logistic Regression

Before diving into the solution, we must thoroughly understand the problem. Overfitting in logistic regression manifests differently than in linear regression, and understanding these nuances is essential for appreciating why regularization is necessary.

1.1 Maximum Likelihood Estimation Pathologies

In standard logistic regression, we maximize the log-likelihood:

$$\mathcal{L}(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right]$$

where $p_i = \sigma(\mathbf{x}_i^\top \boldsymbol{\beta})$ and $\sigma(z) = 1/(1+e^{-z})$ is the sigmoid function.

The Perfect Separation Problem: When data is linearly separable, the MLE does not exist in the finite parameter space. The optimization algorithm will drive coefficients toward infinity, attempting to create a decision boundary that perfectly separates the classes.

Complete and Quasi-Complete Separation

Complete separation occurs when a hyperplane can perfectly divide the two classes with no overlap. Quasi-complete separation occurs when there exists at least one observation on the decision boundary. In both cases, the MLE is undefined or lies at infinity, causing numerical instability and infinite standard errors for affected coefficients.

1.2 High-Dimensional Settings

When the number of features $p$ is large relative to the sample size $n$, several problems emerge:

Variance Explosion: Even when the MLE exists, the estimator variance becomes extremely large. Small changes in training data cause dramatic changes in the estimated coefficients, making the model unreliable.

Multicollinearity: Correlated features create near-singular design matrices, amplifying coefficient instability. A small amount of noise can completely change which correlated feature gets the "credit" for predictive power.

Spurious Correlations: With many features, some will be spuriously correlated with the outcome by chance alone. The MLE will exploit these spurious correlations, fitting noise rather than signal.

Overfitting Manifestations in Logistic Regression
Scenario	Symptom	Consequence
Perfect separation	Coefficients → ±∞	Non-convergence, undefined standard errors
High p/n ratio	Extreme coefficient values	Poor generalization, high variance
Multicollinearity	Unstable coefficient signs	Uninterpretable model, sensitivity to data
Small sample size	Overconfident predictions	Predicted probabilities near 0 or 1

1.3 The Bias-Variance Tradeoff Revisited

For classification, we care about the expected prediction error on new data. This error can be decomposed:

$$\text{EPE} = \text{Irreducible Error} + \text{Bias}^2 + \text{Variance}$$

Unregularized MLE has zero bias (it's asymptotically unbiased under correct model specification) but potentially unbounded variance in problematic settings. Regularization introduces controlled bias in exchange for substantial variance reduction, often improving overall prediction accuracy.

This is not merely a theoretical concern—it's the fundamental justification for regularization. We deliberately move away from the "best" unbiased estimator because bias can be a worthwhile price for stability.

Mathematical Formulation of L2 Regularization

2.1 The Regularized Objective Function

L2 regularization modifies the maximum likelihood objective by adding a penalty term proportional to the squared L2 norm of the coefficient vector. The regularized objective becomes:

$$\mathcal{L}{\lambda}(\boldsymbol{\beta}) = \sum{i=1}^{n} \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right] - \frac{\lambda}{2} |\boldsymbol{\beta}|_2^2$$

or equivalently, we minimize the penalized negative log-likelihood:

$$J(\boldsymbol{\beta}) = -\sum_{i=1}^{n} \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right] + \frac{\lambda}{2} \sum_{j=1}^{p} \beta_j^2$$

where:

$\lambda \geq 0$ is the regularization strength (hyperparameter)
$|\boldsymbol{\beta}|2^2 = \sum{j=1}^{p} \beta_j^2$ is the squared L2 norm
The factor of $1/2$ is a convention that simplifies derivatives

Intercept Regularization

By convention, the intercept term $\beta_0$ is typically not regularized. Penalizing the intercept would bias the model toward predicting equal class probabilities, which is inappropriate when classes are imbalanced. The penalty applies only to the coefficients $\beta_1, \ldots, \beta_p$ associated with features.

2.2 Constrained Optimization Perspective

The penalized formulation is equivalent to a constrained optimization problem. For every value of $\lambda$, there exists a corresponding constraint $t$ such that:

$$\begin{aligned} \text{maximize} \quad & \mathcal{L}(\boldsymbol{\beta}) \ \text{subject to} \quad & |\boldsymbol{\beta}|_2^2 \leq t \end{aligned}$$

This constrained formulation is often more intuitive: we're maximizing the likelihood subject to a "budget" constraint on how large coefficients can be. The relationship between $\lambda$ and $t$ is monotonic—larger $\lambda$ corresponds to smaller $t$ (tighter constraint).

Geometric Interpretation: The constraint defines an ellipsoid in parameter space (specifically, a hypersphere when all features are on the same scale). We seek the point on this ellipsoid that maximizes the log-likelihood.

l2_regularization_objective.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from scipy.special import expit  # Numerically stable sigmoid
 
def l2_regularized_logistic_loss(beta, X, y, lambda_reg, fit_intercept=True):
    """
    Compute the L2-regularized logistic regression loss.
    
    Parameters
    ----------
    beta : array-like, shape (p,) or (p+1,)
        Coefficient vector (includes intercept if fit_intercept=True)
    X : array-like, shape (n, p)
        Feature matrix
    y : array-like, shape (n,)
        Binary labels (0 or 1)
    lambda_reg : float
        Regularization strength (lambda >= 0)
    fit_intercept : bool
        Whether first element of beta is the intercept
    
    Returns
    -------
    loss : float
        The penalized negative log-likelihood
    """
    n = X.shape[0]
    
    # Separate intercept from feature coefficients
    if fit_intercept:
        intercept = beta[0]
        coef = beta[1:]
        linear_pred = X @ coef + intercept
    else:
        linear_pred = X @ beta
        coef = beta
    
    # Compute probabilities using numerically stable sigmoid
    prob = expit(linear_pred)
    
    # Clip probabilities to avoid log(0)
    eps = 1e-15
    prob = np.clip(prob, eps, 1 - eps)
    
    # Negative log-likelihood (cross-entropy loss)
    nll = -np.sum(y * np.log(prob) + (1 - y) * np.log(1 - prob))
    
    # L2 penalty on feature coefficients (not intercept)
    l2_penalty = 0.5 * lambda_reg * np.sum(coef ** 2)
    
    return nll + l2_penalty
 
 
def l2_regularized_logistic_gradient(beta, X, y, lambda_reg, fit_intercept=True):
    """
    Compute the gradient of the L2-regularized logistic regression loss.
    
    The gradient combines the log-likelihood gradient with the L2 penalty gradient.
    """
    n = X.shape[0]
    
    if fit_intercept:
        intercept = beta[0]
        coef = beta[1:]
        linear_pred = X @ coef + intercept
    else:
        linear_pred = X @ beta
        coef = beta
    
    # Prediction residuals: (p_i - y_i)
    prob = expit(linear_pred)
    residuals = prob - y
    
    if fit_intercept:
        # Gradient w.r.t. intercept (no regularization)
        grad_intercept = np.sum(residuals)
        
        # Gradient w.r.t. coefficients (includes L2 penalty)
        grad_coef = X.T @ residuals + lambda_reg * coef
        
        return np.concatenate([[grad_intercept], grad_coef])
    else:
        return X.T @ residuals + lambda_reg * beta

2.3 Alternative Parameterizations

Different software packages use different parameterizations of the regularization strength:

Framework	Parameterization	Relationship to $\lambda$
scikit-learn	$C = 1/\lambda$	Inverse regularization strength
statsmodels	$\alpha = \lambda$	Direct regularization strength
R (glmnet)	$\lambda$ with scaling	$\lambda_{\text{glmnet}} \cdot n = \lambda$

scikit-learn convention: Uses $C$ where larger values mean less regularization (confusing but historical). The objective becomes:

$$J(\boldsymbol{\beta}) = \frac{1}{C} \cdot \frac{1}{2}|\boldsymbol{\beta}|2^2 + \frac{1}{n}\sum{i=1}^{n} \ell(y_i, \hat{y}_i)$$

Always verify the parameterization when using new software to ensure correct hyperparameter specification.

Geometric and Intuitive Understanding

3.1 The Shrinkage Effect

L2 regularization implements shrinkage: it systematically pulls coefficient estimates toward zero. This isn't merely a computational trick but has profound statistical consequences.

Consider the gradient of the L2 penalty:

$$\frac{\partial}{\partial \beta_j} \left( \frac{\lambda}{2} \sum_k \beta_k^2 \right) = \lambda \beta_j$$

This gradient is proportional to the coefficient value itself. Large coefficients incur larger gradients pushing them toward zero, while small coefficients experience less penalty. The result is proportional shrinkage:

$$\hat{\beta}_j^{\text{ridge}} \approx \frac{1}{1 + \lambda/\text{curvature}} \cdot \hat{\beta}_j^{\text{MLE}}$$

where "curvature" refers to the second derivative of the log-likelihood. Coefficients are shrunk toward zero by a factor that depends on both $\lambda$ and how well-determined they are by the data.

Shrinkage Never Reaches Zero

Unlike L1 regularization (Lasso), L2 regularization never shrinks coefficients exactly to zero. The penalty gradient vanishes as $\beta_j \to 0$, so there's always a residual contribution. This means L2 regularization cannot perform feature selection—all features remain in the model with (typically) small but non-zero coefficients.

3.2 Geometric Visualization

The geometry of L2 regularization illuminates why it works:

Likelihood Contours: In the two-coefficient case, the log-likelihood defines contour lines (curves of constant likelihood) in the $(\beta_1, \beta_2)$ plane. The MLE sits at the center of these concentric ellipses.

Constraint Region: The L2 constraint $\beta_1^2 + \beta_2^2 \leq t$ defines a circular disk centered at the origin.

Solution Location: The regularized solution occurs where the highest-valued likelihood contour touches the constraint disk. Due to the circular shape, this tangent point typically has both coefficients non-zero but smaller in magnitude than the MLE.

Converting Mermaid diagram...

3.3 Effect on Correlated Features

L2 regularization handles correlated features gracefully through coefficient sharing:

When two features $x_1$ and $x_2$ are highly correlated and both predictive of $y$:

Unregularized MLE: Arbitrarily assigns most predictive power to one feature, with the other potentially receiving an opposite-signed coefficient for "correction"
L2 Regularized: Distributes weight more evenly between correlated predictors, producing stable, interpretable coefficients

This behavior arises because the L2 penalty is minimized when equal total weight is spread across correlated features rather than concentrated in one. If $\beta_1 + \beta_2 = c$ is required for good fit, then $\beta_1 = \beta_2 = c/2$ minimizes $\beta_1^2 + \beta_2^2$.

L2 Regularization Effects Summary
Property	Unregularized MLE	L2 Regularized
Coefficient magnitude	Can be arbitrarily large	Bounded by regularization strength
Correlated features	Unstable, arbitrary assignment	Weights distributed evenly
Feature selection	Implicit via p-values	All features retained
Perfect separation	Coefficients diverge	Finite, stable solution
Variance	Can be very high	Substantially reduced
Bias	Zero (asymptotically)	Small, controlled bias

Optimization and Computational Methods

4.1 Convexity Guarantee

The L2-regularized logistic regression objective is strictly convex when $\lambda > 0$. This has profound computational implications:

Original Log-Likelihood: The negative log-likelihood is convex (the sigmoid function's log-likelihood is concave), but not strictly convex—the Hessian can be singular when features are collinear.

Adding L2 Penalty: The L2 penalty $\frac{\lambda}{2}|\boldsymbol{\beta}|_2^2$ is strictly convex with Hessian $\lambda \mathbf{I}$. Adding it to any convex function yields a strictly convex function.

Consequence: The regularized objective has a unique global minimum. Any local minimum is the global minimum, and gradient-based methods are guaranteed to converge to it.

4.2 Gradient and Hessian

For efficient optimization, we need the gradient and Hessian of the regularized objective.

Gradient (for minimization):

$$\nabla J(\boldsymbol{\beta}) = \sum_{i=1}^{n} (p_i - y_i) \mathbf{x}_i + \lambda \boldsymbol{\beta}$$

Or in matrix form:

$$\nabla J(\boldsymbol{\beta}) = \mathbf{X}^\top (\mathbf{p} - \mathbf{y}) + \lambda \boldsymbol{\beta}$$

where $\mathbf{p} = [p_1, \ldots, p_n]^\top$ is the vector of predicted probabilities.

Hessian:

$$\mathbf{H} = \nabla^2 J(\boldsymbol{\beta}) = \mathbf{X}^\top \mathbf{W} \mathbf{X} + \lambda \mathbf{I}$$

where $\mathbf{W} = \text{diag}(p_1(1-p_1), \ldots, p_n(1-p_n))$ is a diagonal matrix of variance weights.

The regularization term $\lambda \mathbf{I}$ ensures the Hessian is always positive definite, guaranteeing well-conditioned optimization.

l2_logistic_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
from scipy.optimize import minimize
from scipy.special import expit
 
class L2RegularizedLogisticRegression:
    """
    L2-Regularized Logistic Regression using Newton-Raphson (IRLS).
    
    This implementation demonstrates the IRLS algorithm with L2 regularization,
    which is equivalent to Newton's method for logistic regression.
    """
    
    def __init__(self, lambda_reg=1.0, max_iter=100, tol=1e-6):
        self.lambda_reg = lambda_reg
        self.max_iter = max_iter
        self.tol = tol
        self.coef_ = None
        self.intercept_ = None
        self.n_iter_ = 0
        
    def fit(self, X, y):
        """
        Fit the L2-regularized logistic regression model using IRLS.
        
        The IRLS update is:
        beta_new = (X'WX + lambda*I)^{-1} * X'W * z
        
        where:
        - W = diag(p_i * (1 - p_i)) is the weight matrix
        - z = X*beta + W^{-1}(y - p) is the working response
        """
        n, p = X.shape
        
        # Add intercept column
        X_aug = np.column_stack([np.ones(n), X])
        
        # Initialize coefficients at zero
        beta = np.zeros(p + 1)
        
        # Regularization matrix (don't penalize intercept)
        reg_matrix = self.lambda_reg * np.eye(p + 1)
        reg_matrix[0, 0] = 0  # No regularization on intercept
        
        for iteration in range(self.max_iter):
            # Compute predicted probabilities
            linear_pred = X_aug @ beta
            p_pred = expit(linear_pred)
            
            # Compute weights (variance of binomial)
            weights = p_pred * (1 - p_pred)
            weights = np.maximum(weights, 1e-10)  # Numerical stability
            
            # Weight matrix
            W = np.diag(weights)
            
            # Working response (adjusted dependent variable)
            z = linear_pred + (y - p_pred) / weights
            
            # Compute the regularized Hessian
            H_reg = X_aug.T @ W @ X_aug + reg_matrix
            
            # Compute the right-hand side
            rhs = X_aug.T @ W @ z
            
            # Solve the normal equations
            beta_new = np.linalg.solve(H_reg, rhs)
            
            # Check for convergence
            diff = np.max(np.abs(beta_new - beta))
            beta = beta_new
            
            if diff < self.tol:
                self.n_iter_ = iteration + 1
                break
        else:
            self.n_iter_ = self.max_iter
        
        # Store results
        self.intercept_ = beta[0]
        self.coef_ = beta[1:]
        
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        linear_pred = X @ self.coef_ + self.intercept_
        prob_1 = expit(linear_pred)
        return np.column_stack([1 - prob_1, prob_1])
    
    def predict(self, X, threshold=0.5):
        """Predict class labels."""
        return (self.predict_proba(X)[:, 1] >= threshold).astype(int)
 
 
# Demonstration of IRLS convergence
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate sample data
    n, p = 200, 10
    X = np.random.randn(n, p)
    true_coef = np.array([1.5, -1.0, 0.5, 0, 0, 0, 0, 0, 0, 0])
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(int)
    
    # Fit models with different regularization strengths
    for lam in [0.01, 0.1, 1.0, 10.0]:
        model = L2RegularizedLogisticRegression(lambda_reg=lam)
        model.fit(X, y)
        print(f"lambda={lam:5.2f}: coef magnitude = {np.linalg.norm(model.coef_):.4f}, "
              f"iterations = {model.n_iter_}")

4.3 Optimization Algorithms

Several algorithms efficiently solve the L2-regularized problem:

Newton-Raphson / IRLS: The Iteratively Reweighted Least Squares algorithm is Newton's method applied to logistic regression. With L2 regularization, the update becomes:

$$\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - (\mathbf{H} + \lambda \mathbf{I})^{-1} \nabla J(\boldsymbol{\beta}^{(t)})$$

This converges quadratically near the optimum.

L-BFGS: A quasi-Newton method that approximates the Hessian using gradient history. Efficient for medium-scale problems where the Hessian is expensive to compute or store.

Stochastic Gradient Descent (SGD): For massive datasets, SGD updates parameters using gradient estimates from mini-batches. The L2 penalty gradient is trivially incorporated as $\lambda \boldsymbol{\beta}$.

Feature Scaling is Critical

L2 regularization is scale-dependent. If feature A is measured in meters and feature B in millimeters, the coefficient for B will be 1000x smaller and experience 1,000,000x less penalty. Always standardize features (zero mean, unit variance) before applying L2 regularization. This ensures all coefficients are penalized on a fair, comparable scale.

Bayesian Interpretation

5.1 Ridge as Gaussian Prior

L2 regularization admits a compelling Bayesian interpretation: it corresponds to Maximum A Posteriori (MAP) estimation with a Gaussian prior on the coefficients.

Setup:

Likelihood: $p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta})$ is the binomial likelihood from logistic regression
Prior: $p(\boldsymbol{\beta}) = \mathcal{N}(\mathbf{0}, \tau^2 \mathbf{I})$, a multivariate Gaussian centered at zero

MAP Objective: We maximize the posterior probability:

$$\hat{\boldsymbol{\beta}}{\text{MAP}} = \arg\max{\boldsymbol{\beta}} , p(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) = \arg\max_{\boldsymbol{\beta}} , p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$

Taking the log:

$$\log p(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) \propto \underbrace{\sum_i [y_i \log p_i + (1-y_i)\log(1-p_i)]}_{\text{Log-likelihood}} - \underbrace{\frac{1}{2\tau^2} |\boldsymbol{\beta}|2^2}{\text{Log-prior}}$$

Setting $\lambda = 1/\tau^2$ recovers the L2-penalized objective exactly.

Interpretation of λ

The regularization parameter $\lambda = 1/\tau^2$ controls our prior beliefs. Large $\lambda$ (small $\tau^2$) means we have a strong prior belief that coefficients should be near zero—the prior is "tight." Small $\lambda$ (large $\tau^2$) means we have weak prior beliefs and let the data dominate—the prior is "diffuse."

5.2 Implications of the Gaussian Prior

The Gaussian prior has several important properties that explain L2 regularization behavior:

Continuous Density at Zero: Unlike the Laplace prior (L1), the Gaussian prior has a smooth, continuous density at zero with zero derivative. This explains why L2 regularization shrinks coefficients smoothly toward zero without ever reaching exactly zero.

Quadratic Penalization: The log of a Gaussian is a quadratic function. This ensures the penalty is differentiable everywhere, enabling efficient gradient-based optimization.

Independence Assumption: The isotropic prior $\mathcal{N}(\mathbf{0}, \tau^2 \mathbf{I})$ assumes coefficients are independent a priori. This is a simplifying assumption—in practice, domain knowledge might suggest correlated priors, leading to more sophisticated regularizers.

5.3 Posterior Approximation

While we typically compute only the MAP estimate, the full Bayesian approach would compute the posterior distribution:

$$p(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) \propto p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$

This posterior is not analytically tractable for logistic regression (unlike linear regression with conjugate priors). However, we can approximate it using:

Laplace approximation: Gaussian approximation centered at the MAP estimate
Variational inference: Optimize a tractable approximating distribution
MCMC: Sample from the posterior using Markov Chain Monte Carlo

These approaches provide uncertainty quantification—not just point estimates but credible intervals for coefficients and predictions.

bayesian_logistic_regression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from scipy.special import expit
from scipy.stats import norm
 
def laplace_approximation(X, y, beta_map, lambda_reg):
    """
    Compute the Laplace approximation to the posterior.
    
    The posterior is approximated as:
    p(beta | y, X) ≈ N(beta_map, H^{-1})
    
    where H is the Hessian of the negative log-posterior at beta_map.
    
    Parameters
    ----------
    X : array, shape (n, p)
        Feature matrix (without intercept column)
    y : array, shape (n,)
        Binary labels
    beta_map : array, shape (p+1,)
        MAP estimate (includes intercept)
    lambda_reg : float
        Regularization strength
    
    Returns
    -------
    beta_map : array
        Mean of approximate posterior
    cov_matrix : array, shape (p+1, p+1)
        Covariance of approximate posterior
    std_errors : array, shape (p+1,)
        Standard errors (sqrt of diagonal of covariance)
    """
    n, p = X.shape
    X_aug = np.column_stack([np.ones(n), X])
    
    # Compute predicted probabilities at MAP estimate
    linear_pred = X_aug @ beta_map
    p_pred = expit(linear_pred)
    
    # Weights for IRLS
    weights = p_pred * (1 - p_pred)
    W = np.diag(weights)
    
    # Hessian of negative log-likelihood
    H_nll = X_aug.T @ W @ X_aug
    
    # Hessian of negative log-prior (regularization)
    H_prior = lambda_reg * np.eye(p + 1)
    H_prior[0, 0] = 0  # No regularization on intercept
    
    # Total Hessian (negative log-posterior)
    H_total = H_nll + H_prior
    
    # Covariance is inverse Hessian
    cov_matrix = np.linalg.inv(H_total)
    std_errors = np.sqrt(np.diag(cov_matrix))
    
    return beta_map, cov_matrix, std_errors
 
 
def compute_credible_intervals(beta_map, std_errors, alpha=0.05):
    """
    Compute credible intervals for coefficients.
    
    Parameters
    ----------
    beta_map : array
        MAP estimates
    std_errors : array
        Standard errors from Laplace approximation
    alpha : float
        Significance level (default 0.05 for 95% intervals)
    
    Returns
    -------
    intervals : array, shape (n_coef, 2)
        Lower and upper bounds of credible intervals
    """
    z = norm.ppf(1 - alpha / 2)
    lower = beta_map - z * std_errors
    upper = beta_map + z * std_errors
    return np.column_stack([lower, upper])
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate data
    n, p = 100, 5
    X = np.random.randn(n, p)
    true_beta = np.array([0.5, 1.0, -0.5, 0.0, 0.0])
    prob = expit(X @ true_beta)
    y = (np.random.rand(n) < prob).astype(int)
    
    # Fit L2-regularized model (using simple gradient descent for demo)
    from scipy.optimize import minimize
    
    def neg_log_posterior(beta, X, y, lam):
        X_aug = np.column_stack([np.ones(len(y)), X])
        linear_pred = X_aug @ beta
        p = expit(linear_pred)
        p = np.clip(p, 1e-15, 1 - 1e-15)
        nll = -np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))
        penalty = 0.5 * lam * np.sum(beta[1:]**2)  # Don't penalize intercept
        return nll + penalty
    
    lambda_reg = 1.0
    beta_init = np.zeros(p + 1)
    result = minimize(neg_log_posterior, beta_init, args=(X, y, lambda_reg))
    beta_map = result.x
    
    # Laplace approximation
    _, cov, se = laplace_approximation(X, y, beta_map, lambda_reg)
    intervals = compute_credible_intervals(beta_map, se)
    
    print("\nBayesian Logistic Regression with L2 Prior")
    print("-" * 60)
    print(f"{'Param':<8} {'Estimate':>10} {'Std Err':>10} {'95% CI':>20}")
    print("-" * 60)
    for i, name in enumerate(["Intercept"] + [f"x{j+1}" for j in range(p)]):
        print(f"{name:<8} {beta_map[i]:>10.4f} {se[i]:>10.4f} "
              f"[{intervals[i,0]:>7.4f}, {intervals[i,1]:>7.4f}]")

Practical Implementation Considerations

6.1 Preprocessing Requirements

Feature Standardization: As emphasized earlier, L2 regularization requires standardized features. The standard approach:

$$\tilde{x}{ij} = \frac{x{ij} - \bar{x}_j}{s_j}$$

where $\bar{x}_j$ and $s_j$ are the mean and standard deviation of feature $j$.

Binary and Categorical Features: For binary features (0/1), standardization still applies. For categorical features, use appropriate encoding (one-hot or effect coding) and standardize the resulting columns.

Target Encoding: The target variable $y$ should be 0/1 (not -1/+1 or other encodings) for standard log-loss formulation.

6.2 Handling Class Imbalance

L2 regularization interacts with class imbalance in important ways:

Intercept Bias: With unbalanced classes (e.g., 1% positives), the fitted intercept captures the prior class probability. The intercept should not be regularized to preserve this calibration.

Weighted Samples: For highly imbalanced problems, weight the log-likelihood contributions:

$$J(\boldsymbol{\beta}) = -\sum_{i=1}^{n} w_i \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right] + \frac{\lambda}{2} |\boldsymbol{\beta}|_2^2$$

Common weighting schemes:

Inverse class frequency: $w_i = 1/n_c$ where $n_c$ is the count of class $c$
Balanced weights: class_weight='balanced' in scikit-learn

practical_l2_logistic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
 
def create_l2_logistic_pipeline(C=1.0, class_weight=None):
    """
    Create a complete pipeline for L2-regularized logistic regression.
    
    Parameters
    ----------
    C : float
        Inverse regularization strength (1/lambda in our notation)
    class_weight : dict, 'balanced', or None
        Weights for handling class imbalance
    
    Returns
    -------
    pipeline : sklearn.Pipeline
        Complete preprocessing + model pipeline
    """
    return Pipeline([
        ('scaler', StandardScaler()),  # Critical: standardize features
        ('classifier', LogisticRegression(
            penalty='l2',
            C=C,
            class_weight=class_weight,
            solver='lbfgs',
            max_iter=1000,
            random_state=42
        ))
    ])
 
 
def compare_regularization_strengths(X, y, C_values):
    """
    Compare model performance across regularization strengths.
    
    Demonstrates the bias-variance tradeoff as C changes.
    """
    results = []
    
    for C in C_values:
        pipeline = create_l2_logistic_pipeline(C=C)
        
        # Cross-validation scores
        scores = cross_val_score(
            pipeline, X, y, 
            cv=5, 
            scoring='roc_auc'
        )
        
        # Fit on full data to examine coefficients
        pipeline.fit(X, y)
        coef = pipeline.named_steps['classifier'].coef_[0]
        
        results.append({
            'C': C,
            'lambda': 1/C,
            'mean_auc': scores.mean(),
            'std_auc': scores.std(),
            'coef_l2_norm': np.linalg.norm(coef),
            'max_abs_coef': np.max(np.abs(coef)),
            'n_near_zero': np.sum(np.abs(coef) < 0.01)
        })
    
    return results
 
 
# Example: Demonstrating the regularization path
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # Generate classification data
    X, y = make_classification(
        n_samples=500,
        n_features=50,
        n_informative=10,
        n_redundant=20,
        n_clusters_per_class=2,
        random_state=42
    )
    
    # Test range of regularization strengths
    C_values = [100, 10, 1, 0.1, 0.01, 0.001]
    results = compare_regularization_strengths(X, y, C_values)
    
    print("\nL2 Regularization Comparison")
    print("=" * 80)
    print(f"{'C':>8} {'λ':>8} {'AUC':>12} {'||β||₂':>10} {'max|β|':>10}")
    print("-" * 80)
    
    for r in results:
        print(f"{r['C']:>8.3f} {r['lambda']:>8.3f} "
              f"{r['mean_auc']:.4f}±{r['std_auc']:.4f} "
              f"{r['coef_l2_norm']:>10.4f} {r['max_abs_coef']:>10.4f}")

6.3 Coefficient Interpretation

L2-regularized coefficients require careful interpretation:

Shrinkage Adjustment: Coefficients are systematically shrunk toward zero. Their magnitudes are biased estimates of the true effect sizes. Use them for prediction and ranking feature importance, but not for unbiased effect size estimation.

Relative Importance: Larger |β_j| indicates greater predictive importance, but the absolute scale depends on λ. Compare coefficients within a model, not across models with different λ.

Odds Ratio Interpretation: For standardized features, $e^{\beta_j}$ gives the multiplicative change in odds for a one-standard-deviation increase in feature $j$. This interpretation is approximate due to regularization bias.

Interpretability Trade-off

L2 regularization improves prediction at the cost of interpretability. If precise coefficient estimation and hypothesis testing are primary goals, consider using unregularized logistic regression on a carefully selected feature set, or penalized regression methods designed for inference (like the de-biased lasso).

Summary and Key Takeaways

L2 regularization transforms logistic regression from a potentially unstable estimator into a robust, reliable classifier. We've covered the complete theory from multiple perspectives.

Key Takeaways

•L2 regularization solves critical instabilities in logistic regression, including perfect separation, high-dimensional settings, and multicollinearity
•The penalty shrinks coefficients toward zero proportionally to their magnitude, trading bias for variance reduction
•Geometrically, L2 regularization constrains coefficients to an ellipsoid, typically finding solutions on the boundary
•The objective remains strictly convex with λ > 0, guaranteeing a unique global optimum
•Bayesian interpretation: L2 regularization equals MAP estimation with a Gaussian prior on coefficients
•Feature standardization is essential for fair and effective regularization across features
•The intercept should not be regularized to preserve proper probability calibration

What's Next

The next page explores L1 regularization (Lasso) for logistic regression, which differs fundamentally in geometry and behavior. While L2 shrinks coefficients smoothly, L1 can drive coefficients exactly to zero—enabling automatic feature selection. Understanding both prepares you for Elastic Net, which combines their strengths.

1 / 5

Loading learning content...

Machine LearningRegularized Logistic Regression

Regularized Logistic Regression

LevelIntermediate

Duration90 mins

TopicRegularized Logistic Regression

1 / 5

L2 Regularization in Logistic Regression

The Regularization Imperative

Prerequisites

The Overfitting Problem in Logistic Regression

1.1 Maximum Likelihood Estimation Pathologies

In standard logistic regression, we maximize the log-likelihood:

$$\mathcal{L}(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right]$$

where $p_i = \sigma(\mathbf{x}_i^\top \boldsymbol{\beta})$ and $\sigma(z) = 1/(1+e^{-z})$ is the sigmoid function.

Complete and Quasi-Complete Separation

1.2 High-Dimensional Settings

When the number of features $p$ is large relative to the sample size $n$, several problems emerge:

Spurious Correlations: With many features, some will be spuriously correlated with the outcome by chance alone. The MLE will exploit these spurious correlations, fitting noise rather than signal.

Overfitting Manifestations in Logistic Regression
Scenario	Symptom	Consequence
Perfect separation	Coefficients → ±∞	Non-convergence, undefined standard errors
High p/n ratio	Extreme coefficient values	Poor generalization, high variance
Multicollinearity	Unstable coefficient signs	Uninterpretable model, sensitivity to data
Small sample size	Overconfident predictions	Predicted probabilities near 0 or 1

1.3 The Bias-Variance Tradeoff Revisited

For classification, we care about the expected prediction error on new data. This error can be decomposed:

$$\text{EPE} = \text{Irreducible Error} + \text{Bias}^2 + \text{Variance}$$

Mathematical Formulation of L2 Regularization

2.1 The Regularized Objective Function

L2 regularization modifies the maximum likelihood objective by adding a penalty term proportional to the squared L2 norm of the coefficient vector. The regularized objective becomes:

$$\mathcal{L}{\lambda}(\boldsymbol{\beta}) = \sum{i=1}^{n} \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right] - \frac{\lambda}{2} |\boldsymbol{\beta}|_2^2$$

or equivalently, we minimize the penalized negative log-likelihood:

$$J(\boldsymbol{\beta}) = -\sum_{i=1}^{n} \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right] + \frac{\lambda}{2} \sum_{j=1}^{p} \beta_j^2$$

where:

$\lambda \geq 0$ is the regularization strength (hyperparameter)
$|\boldsymbol{\beta}|2^2 = \sum{j=1}^{p} \beta_j^2$ is the squared L2 norm
The factor of $1/2$ is a convention that simplifies derivatives

Intercept Regularization

2.2 Constrained Optimization Perspective

The penalized formulation is equivalent to a constrained optimization problem. For every value of $\lambda$, there exists a corresponding constraint $t$ such that:

$$\begin{aligned} \text{maximize} \quad & \mathcal{L}(\boldsymbol{\beta}) \ \text{subject to} \quad & |\boldsymbol{\beta}|_2^2 \leq t \end{aligned}$$

l2_regularization_objective.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from scipy.special import expit  # Numerically stable sigmoid
 
def l2_regularized_logistic_loss(beta, X, y, lambda_reg, fit_intercept=True):
    """
    Compute the L2-regularized logistic regression loss.
    
    Parameters
    ----------
    beta : array-like, shape (p,) or (p+1,)
        Coefficient vector (includes intercept if fit_intercept=True)
    X : array-like, shape (n, p)
        Feature matrix
    y : array-like, shape (n,)
        Binary labels (0 or 1)
    lambda_reg : float
        Regularization strength (lambda >= 0)
    fit_intercept : bool
        Whether first element of beta is the intercept
    
    Returns
    -------
    loss : float
        The penalized negative log-likelihood
    """
    n = X.shape[0]
    
    # Separate intercept from feature coefficients
    if fit_intercept:
        intercept = beta[0]
        coef = beta[1:]
        linear_pred = X @ coef + intercept
    else:
        linear_pred = X @ beta
        coef = beta
    
    # Compute probabilities using numerically stable sigmoid
    prob = expit(linear_pred)
    
    # Clip probabilities to avoid log(0)
    eps = 1e-15
    prob = np.clip(prob, eps, 1 - eps)
    
    # Negative log-likelihood (cross-entropy loss)
    nll = -np.sum(y * np.log(prob) + (1 - y) * np.log(1 - prob))
    
    # L2 penalty on feature coefficients (not intercept)
    l2_penalty = 0.5 * lambda_reg * np.sum(coef ** 2)
    
    return nll + l2_penalty
 
 
def l2_regularized_logistic_gradient(beta, X, y, lambda_reg, fit_intercept=True):
    """
    Compute the gradient of the L2-regularized logistic regression loss.
    
    The gradient combines the log-likelihood gradient with the L2 penalty gradient.
    """
    n = X.shape[0]
    
    if fit_intercept:
        intercept = beta[0]
        coef = beta[1:]
        linear_pred = X @ coef + intercept
    else:
        linear_pred = X @ beta
        coef = beta
    
    # Prediction residuals: (p_i - y_i)
    prob = expit(linear_pred)
    residuals = prob - y
    
    if fit_intercept:
        # Gradient w.r.t. intercept (no regularization)
        grad_intercept = np.sum(residuals)
        
        # Gradient w.r.t. coefficients (includes L2 penalty)
        grad_coef = X.T @ residuals + lambda_reg * coef
        
        return np.concatenate([[grad_intercept], grad_coef])
    else:
        return X.T @ residuals + lambda_reg * beta

2.3 Alternative Parameterizations

Different software packages use different parameterizations of the regularization strength:

Framework	Parameterization	Relationship to $\lambda$
scikit-learn	$C = 1/\lambda$	Inverse regularization strength
statsmodels	$\alpha = \lambda$	Direct regularization strength
R (glmnet)	$\lambda$ with scaling	$\lambda_{\text{glmnet}} \cdot n = \lambda$

scikit-learn convention: Uses $C$ where larger values mean less regularization (confusing but historical). The objective becomes:

$$J(\boldsymbol{\beta}) = \frac{1}{C} \cdot \frac{1}{2}|\boldsymbol{\beta}|2^2 + \frac{1}{n}\sum{i=1}^{n} \ell(y_i, \hat{y}_i)$$

Always verify the parameterization when using new software to ensure correct hyperparameter specification.

Geometric and Intuitive Understanding

3.1 The Shrinkage Effect

L2 regularization implements shrinkage: it systematically pulls coefficient estimates toward zero. This isn't merely a computational trick but has profound statistical consequences.

Consider the gradient of the L2 penalty:

$$\frac{\partial}{\partial \beta_j} \left( \frac{\lambda}{2} \sum_k \beta_k^2 \right) = \lambda \beta_j$$

$$\hat{\beta}_j^{\text{ridge}} \approx \frac{1}{1 + \lambda/\text{curvature}} \cdot \hat{\beta}_j^{\text{MLE}}$$

where "curvature" refers to the second derivative of the log-likelihood. Coefficients are shrunk toward zero by a factor that depends on both $\lambda$ and how well-determined they are by the data.

Shrinkage Never Reaches Zero

3.2 Geometric Visualization

The geometry of L2 regularization illuminates why it works:

Constraint Region: The L2 constraint $\beta_1^2 + \beta_2^2 \leq t$ defines a circular disk centered at the origin.

Converting Mermaid diagram...

3.3 Effect on Correlated Features

L2 regularization handles correlated features gracefully through coefficient sharing:

When two features $x_1$ and $x_2$ are highly correlated and both predictive of $y$:

Unregularized MLE: Arbitrarily assigns most predictive power to one feature, with the other potentially receiving an opposite-signed coefficient for "correction"
L2 Regularized: Distributes weight more evenly between correlated predictors, producing stable, interpretable coefficients

L2 Regularization Effects Summary
Property	Unregularized MLE	L2 Regularized
Coefficient magnitude	Can be arbitrarily large	Bounded by regularization strength
Correlated features	Unstable, arbitrary assignment	Weights distributed evenly
Feature selection	Implicit via p-values	All features retained
Perfect separation	Coefficients diverge	Finite, stable solution
Variance	Can be very high	Substantially reduced
Bias	Zero (asymptotically)	Small, controlled bias

Optimization and Computational Methods

4.1 Convexity Guarantee

The L2-regularized logistic regression objective is strictly convex when $\lambda > 0$. This has profound computational implications:

Consequence: The regularized objective has a unique global minimum. Any local minimum is the global minimum, and gradient-based methods are guaranteed to converge to it.

4.2 Gradient and Hessian

For efficient optimization, we need the gradient and Hessian of the regularized objective.

Gradient (for minimization):

$$\nabla J(\boldsymbol{\beta}) = \sum_{i=1}^{n} (p_i - y_i) \mathbf{x}_i + \lambda \boldsymbol{\beta}$$

Or in matrix form:

$$\nabla J(\boldsymbol{\beta}) = \mathbf{X}^\top (\mathbf{p} - \mathbf{y}) + \lambda \boldsymbol{\beta}$$

where $\mathbf{p} = [p_1, \ldots, p_n]^\top$ is the vector of predicted probabilities.

Hessian:

$$\mathbf{H} = \nabla^2 J(\boldsymbol{\beta}) = \mathbf{X}^\top \mathbf{W} \mathbf{X} + \lambda \mathbf{I}$$

where $\mathbf{W} = \text{diag}(p_1(1-p_1), \ldots, p_n(1-p_n))$ is a diagonal matrix of variance weights.

The regularization term $\lambda \mathbf{I}$ ensures the Hessian is always positive definite, guaranteeing well-conditioned optimization.

l2_logistic_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
from scipy.optimize import minimize
from scipy.special import expit
 
class L2RegularizedLogisticRegression:
    """
    L2-Regularized Logistic Regression using Newton-Raphson (IRLS).
    
    This implementation demonstrates the IRLS algorithm with L2 regularization,
    which is equivalent to Newton's method for logistic regression.
    """
    
    def __init__(self, lambda_reg=1.0, max_iter=100, tol=1e-6):
        self.lambda_reg = lambda_reg
        self.max_iter = max_iter
        self.tol = tol
        self.coef_ = None
        self.intercept_ = None
        self.n_iter_ = 0
        
    def fit(self, X, y):
        """
        Fit the L2-regularized logistic regression model using IRLS.
        
        The IRLS update is:
        beta_new = (X'WX + lambda*I)^{-1} * X'W * z
        
        where:
        - W = diag(p_i * (1 - p_i)) is the weight matrix
        - z = X*beta + W^{-1}(y - p) is the working response
        """
        n, p = X.shape
        
        # Add intercept column
        X_aug = np.column_stack([np.ones(n), X])
        
        # Initialize coefficients at zero
        beta = np.zeros(p + 1)
        
        # Regularization matrix (don't penalize intercept)
        reg_matrix = self.lambda_reg * np.eye(p + 1)
        reg_matrix[0, 0] = 0  # No regularization on intercept
        
        for iteration in range(self.max_iter):
            # Compute predicted probabilities
            linear_pred = X_aug @ beta
            p_pred = expit(linear_pred)
            
            # Compute weights (variance of binomial)
            weights = p_pred * (1 - p_pred)
            weights = np.maximum(weights, 1e-10)  # Numerical stability
            
            # Weight matrix
            W = np.diag(weights)
            
            # Working response (adjusted dependent variable)
            z = linear_pred + (y - p_pred) / weights
            
            # Compute the regularized Hessian
            H_reg = X_aug.T @ W @ X_aug + reg_matrix
            
            # Compute the right-hand side
            rhs = X_aug.T @ W @ z
            
            # Solve the normal equations
            beta_new = np.linalg.solve(H_reg, rhs)
            
            # Check for convergence
            diff = np.max(np.abs(beta_new - beta))
            beta = beta_new
            
            if diff < self.tol:
                self.n_iter_ = iteration + 1
                break
        else:
            self.n_iter_ = self.max_iter
        
        # Store results
        self.intercept_ = beta[0]
        self.coef_ = beta[1:]
        
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        linear_pred = X @ self.coef_ + self.intercept_
        prob_1 = expit(linear_pred)
        return np.column_stack([1 - prob_1, prob_1])
    
    def predict(self, X, threshold=0.5):
        """Predict class labels."""
        return (self.predict_proba(X)[:, 1] >= threshold).astype(int)
 
 
# Demonstration of IRLS convergence
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate sample data
    n, p = 200, 10
    X = np.random.randn(n, p)
    true_coef = np.array([1.5, -1.0, 0.5, 0, 0, 0, 0, 0, 0, 0])
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(int)
    
    # Fit models with different regularization strengths
    for lam in [0.01, 0.1, 1.0, 10.0]:
        model = L2RegularizedLogisticRegression(lambda_reg=lam)
        model.fit(X, y)
        print(f"lambda={lam:5.2f}: coef magnitude = {np.linalg.norm(model.coef_):.4f}, "
              f"iterations = {model.n_iter_}")

4.3 Optimization Algorithms

Several algorithms efficiently solve the L2-regularized problem:

Newton-Raphson / IRLS: The Iteratively Reweighted Least Squares algorithm is Newton's method applied to logistic regression. With L2 regularization, the update becomes:

$$\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - (\mathbf{H} + \lambda \mathbf{I})^{-1} \nabla J(\boldsymbol{\beta}^{(t)})$$

This converges quadratically near the optimum.

L-BFGS: A quasi-Newton method that approximates the Hessian using gradient history. Efficient for medium-scale problems where the Hessian is expensive to compute or store.

Feature Scaling is Critical

Bayesian Interpretation

5.1 Ridge as Gaussian Prior

L2 regularization admits a compelling Bayesian interpretation: it corresponds to Maximum A Posteriori (MAP) estimation with a Gaussian prior on the coefficients.

Setup:

Likelihood: $p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta})$ is the binomial likelihood from logistic regression
Prior: $p(\boldsymbol{\beta}) = \mathcal{N}(\mathbf{0}, \tau^2 \mathbf{I})$, a multivariate Gaussian centered at zero

MAP Objective: We maximize the posterior probability:

Taking the log:

Setting $\lambda = 1/\tau^2$ recovers the L2-penalized objective exactly.

Interpretation of λ

5.2 Implications of the Gaussian Prior

The Gaussian prior has several important properties that explain L2 regularization behavior:

Quadratic Penalization: The log of a Gaussian is a quadratic function. This ensures the penalty is differentiable everywhere, enabling efficient gradient-based optimization.

5.3 Posterior Approximation

While we typically compute only the MAP estimate, the full Bayesian approach would compute the posterior distribution:

$$p(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) \propto p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$

This posterior is not analytically tractable for logistic regression (unlike linear regression with conjugate priors). However, we can approximate it using:

Laplace approximation: Gaussian approximation centered at the MAP estimate
Variational inference: Optimize a tractable approximating distribution
MCMC: Sample from the posterior using Markov Chain Monte Carlo

These approaches provide uncertainty quantification—not just point estimates but credible intervals for coefficients and predictions.

bayesian_logistic_regression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from scipy.special import expit
from scipy.stats import norm
 
def laplace_approximation(X, y, beta_map, lambda_reg):
    """
    Compute the Laplace approximation to the posterior.
    
    The posterior is approximated as:
    p(beta | y, X) ≈ N(beta_map, H^{-1})
    
    where H is the Hessian of the negative log-posterior at beta_map.
    
    Parameters
    ----------
    X : array, shape (n, p)
        Feature matrix (without intercept column)
    y : array, shape (n,)
        Binary labels
    beta_map : array, shape (p+1,)
        MAP estimate (includes intercept)
    lambda_reg : float
        Regularization strength
    
    Returns
    -------
    beta_map : array
        Mean of approximate posterior
    cov_matrix : array, shape (p+1, p+1)
        Covariance of approximate posterior
    std_errors : array, shape (p+1,)
        Standard errors (sqrt of diagonal of covariance)
    """
    n, p = X.shape
    X_aug = np.column_stack([np.ones(n), X])
    
    # Compute predicted probabilities at MAP estimate
    linear_pred = X_aug @ beta_map
    p_pred = expit(linear_pred)
    
    # Weights for IRLS
    weights = p_pred * (1 - p_pred)
    W = np.diag(weights)
    
    # Hessian of negative log-likelihood
    H_nll = X_aug.T @ W @ X_aug
    
    # Hessian of negative log-prior (regularization)
    H_prior = lambda_reg * np.eye(p + 1)
    H_prior[0, 0] = 0  # No regularization on intercept
    
    # Total Hessian (negative log-posterior)
    H_total = H_nll + H_prior
    
    # Covariance is inverse Hessian
    cov_matrix = np.linalg.inv(H_total)
    std_errors = np.sqrt(np.diag(cov_matrix))
    
    return beta_map, cov_matrix, std_errors
 
 
def compute_credible_intervals(beta_map, std_errors, alpha=0.05):
    """
    Compute credible intervals for coefficients.
    
    Parameters
    ----------
    beta_map : array
        MAP estimates
    std_errors : array
        Standard errors from Laplace approximation
    alpha : float
        Significance level (default 0.05 for 95% intervals)
    
    Returns
    -------
    intervals : array, shape (n_coef, 2)
        Lower and upper bounds of credible intervals
    """
    z = norm.ppf(1 - alpha / 2)
    lower = beta_map - z * std_errors
    upper = beta_map + z * std_errors
    return np.column_stack([lower, upper])
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate data
    n, p = 100, 5
    X = np.random.randn(n, p)
    true_beta = np.array([0.5, 1.0, -0.5, 0.0, 0.0])
    prob = expit(X @ true_beta)
    y = (np.random.rand(n) < prob).astype(int)
    
    # Fit L2-regularized model (using simple gradient descent for demo)
    from scipy.optimize import minimize
    
    def neg_log_posterior(beta, X, y, lam):
        X_aug = np.column_stack([np.ones(len(y)), X])
        linear_pred = X_aug @ beta
        p = expit(linear_pred)
        p = np.clip(p, 1e-15, 1 - 1e-15)
        nll = -np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))
        penalty = 0.5 * lam * np.sum(beta[1:]**2)  # Don't penalize intercept
        return nll + penalty
    
    lambda_reg = 1.0
    beta_init = np.zeros(p + 1)
    result = minimize(neg_log_posterior, beta_init, args=(X, y, lambda_reg))
    beta_map = result.x
    
    # Laplace approximation
    _, cov, se = laplace_approximation(X, y, beta_map, lambda_reg)
    intervals = compute_credible_intervals(beta_map, se)
    
    print("\nBayesian Logistic Regression with L2 Prior")
    print("-" * 60)
    print(f"{'Param':<8} {'Estimate':>10} {'Std Err':>10} {'95% CI':>20}")
    print("-" * 60)
    for i, name in enumerate(["Intercept"] + [f"x{j+1}" for j in range(p)]):
        print(f"{name:<8} {beta_map[i]:>10.4f} {se[i]:>10.4f} "
              f"[{intervals[i,0]:>7.4f}, {intervals[i,1]:>7.4f}]")

Practical Implementation Considerations

6.1 Preprocessing Requirements

Feature Standardization: As emphasized earlier, L2 regularization requires standardized features. The standard approach:

$$\tilde{x}{ij} = \frac{x{ij} - \bar{x}_j}{s_j}$$

where $\bar{x}_j$ and $s_j$ are the mean and standard deviation of feature $j$.

Target Encoding: The target variable $y$ should be 0/1 (not -1/+1 or other encodings) for standard log-loss formulation.

6.2 Handling Class Imbalance

L2 regularization interacts with class imbalance in important ways:

Intercept Bias: With unbalanced classes (e.g., 1% positives), the fitted intercept captures the prior class probability. The intercept should not be regularized to preserve this calibration.

Weighted Samples: For highly imbalanced problems, weight the log-likelihood contributions:

$$J(\boldsymbol{\beta}) = -\sum_{i=1}^{n} w_i \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right] + \frac{\lambda}{2} |\boldsymbol{\beta}|_2^2$$

Common weighting schemes:

Inverse class frequency: $w_i = 1/n_c$ where $n_c$ is the count of class $c$
Balanced weights: class_weight='balanced' in scikit-learn

practical_l2_logistic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
 
def create_l2_logistic_pipeline(C=1.0, class_weight=None):
    """
    Create a complete pipeline for L2-regularized logistic regression.
    
    Parameters
    ----------
    C : float
        Inverse regularization strength (1/lambda in our notation)
    class_weight : dict, 'balanced', or None
        Weights for handling class imbalance
    
    Returns
    -------
    pipeline : sklearn.Pipeline
        Complete preprocessing + model pipeline
    """
    return Pipeline([
        ('scaler', StandardScaler()),  # Critical: standardize features
        ('classifier', LogisticRegression(
            penalty='l2',
            C=C,
            class_weight=class_weight,
            solver='lbfgs',
            max_iter=1000,
            random_state=42
        ))
    ])
 
 
def compare_regularization_strengths(X, y, C_values):
    """
    Compare model performance across regularization strengths.
    
    Demonstrates the bias-variance tradeoff as C changes.
    """
    results = []
    
    for C in C_values:
        pipeline = create_l2_logistic_pipeline(C=C)
        
        # Cross-validation scores
        scores = cross_val_score(
            pipeline, X, y, 
            cv=5, 
            scoring='roc_auc'
        )
        
        # Fit on full data to examine coefficients
        pipeline.fit(X, y)
        coef = pipeline.named_steps['classifier'].coef_[0]
        
        results.append({
            'C': C,
            'lambda': 1/C,
            'mean_auc': scores.mean(),
            'std_auc': scores.std(),
            'coef_l2_norm': np.linalg.norm(coef),
            'max_abs_coef': np.max(np.abs(coef)),
            'n_near_zero': np.sum(np.abs(coef) < 0.01)
        })
    
    return results
 
 
# Example: Demonstrating the regularization path
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # Generate classification data
    X, y = make_classification(
        n_samples=500,
        n_features=50,
        n_informative=10,
        n_redundant=20,
        n_clusters_per_class=2,
        random_state=42
    )
    
    # Test range of regularization strengths
    C_values = [100, 10, 1, 0.1, 0.01, 0.001]
    results = compare_regularization_strengths(X, y, C_values)
    
    print("\nL2 Regularization Comparison")
    print("=" * 80)
    print(f"{'C':>8} {'λ':>8} {'AUC':>12} {'||β||₂':>10} {'max|β|':>10}")
    print("-" * 80)
    
    for r in results:
        print(f"{r['C']:>8.3f} {r['lambda']:>8.3f} "
              f"{r['mean_auc']:.4f}±{r['std_auc']:.4f} "
              f"{r['coef_l2_norm']:>10.4f} {r['max_abs_coef']:>10.4f}")

6.3 Coefficient Interpretation

L2-regularized coefficients require careful interpretation:

Relative Importance: Larger |β_j| indicates greater predictive importance, but the absolute scale depends on λ. Compare coefficients within a model, not across models with different λ.

Interpretability Trade-off

Summary and Key Takeaways

L2 regularization transforms logistic regression from a potentially unstable estimator into a robust, reliable classifier. We've covered the complete theory from multiple perspectives.

Key Takeaways

•L2 regularization solves critical instabilities in logistic regression, including perfect separation, high-dimensional settings, and multicollinearity
•The penalty shrinks coefficients toward zero proportionally to their magnitude, trading bias for variance reduction
•Geometrically, L2 regularization constrains coefficients to an ellipsoid, typically finding solutions on the boundary
•The objective remains strictly convex with λ > 0, guaranteeing a unique global optimum
•Bayesian interpretation: L2 regularization equals MAP estimation with a Gaussian prior on coefficients
•Feature standardization is essential for fair and effective regularization across features
•The intercept should not be regularized to preserve proper probability calibration

What's Next

1 / 5