Loading learning content...
Logistic regression, despite its conceptual elegance and interpretability, faces a fundamental challenge that becomes critical as model complexity increases: overfitting. When we fit a logistic regression model to data—particularly in high-dimensional settings where the number of features approaches or exceeds the number of observations—the maximum likelihood estimator can exhibit pathological behavior that renders the model useless for prediction.
L2 regularization (also known as Ridge regularization or Tikhonov regularization in the broader numerical analysis context) provides a principled solution to this problem by introducing a penalty term that constrains the magnitude of the model coefficients. This page develops the complete theory of L2-regularized logistic regression, from mathematical foundations to practical implementation.
This page assumes familiarity with standard logistic regression, maximum likelihood estimation, and basic convex optimization. We build upon the logistic regression model and MLE concepts from earlier modules, extending them with regularization machinery.
Before diving into the solution, we must thoroughly understand the problem. Overfitting in logistic regression manifests differently than in linear regression, and understanding these nuances is essential for appreciating why regularization is necessary.
In standard logistic regression, we maximize the log-likelihood:
$$\mathcal{L}(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right]$$
where $p_i = \sigma(\mathbf{x}_i^\top \boldsymbol{\beta})$ and $\sigma(z) = 1/(1+e^{-z})$ is the sigmoid function.
The Perfect Separation Problem: When data is linearly separable, the MLE does not exist in the finite parameter space. The optimization algorithm will drive coefficients toward infinity, attempting to create a decision boundary that perfectly separates the classes.
Complete separation occurs when a hyperplane can perfectly divide the two classes with no overlap. Quasi-complete separation occurs when there exists at least one observation on the decision boundary. In both cases, the MLE is undefined or lies at infinity, causing numerical instability and infinite standard errors for affected coefficients.
When the number of features $p$ is large relative to the sample size $n$, several problems emerge:
Variance Explosion: Even when the MLE exists, the estimator variance becomes extremely large. Small changes in training data cause dramatic changes in the estimated coefficients, making the model unreliable.
Multicollinearity: Correlated features create near-singular design matrices, amplifying coefficient instability. A small amount of noise can completely change which correlated feature gets the "credit" for predictive power.
Spurious Correlations: With many features, some will be spuriously correlated with the outcome by chance alone. The MLE will exploit these spurious correlations, fitting noise rather than signal.
| Scenario | Symptom | Consequence |
|---|---|---|
| Perfect separation | Coefficients → ±∞ | Non-convergence, undefined standard errors |
| High p/n ratio | Extreme coefficient values | Poor generalization, high variance |
| Multicollinearity | Unstable coefficient signs | Uninterpretable model, sensitivity to data |
| Small sample size | Overconfident predictions | Predicted probabilities near 0 or 1 |
For classification, we care about the expected prediction error on new data. This error can be decomposed:
$$\text{EPE} = \text{Irreducible Error} + \text{Bias}^2 + \text{Variance}$$
Unregularized MLE has zero bias (it's asymptotically unbiased under correct model specification) but potentially unbounded variance in problematic settings. Regularization introduces controlled bias in exchange for substantial variance reduction, often improving overall prediction accuracy.
This is not merely a theoretical concern—it's the fundamental justification for regularization. We deliberately move away from the "best" unbiased estimator because bias can be a worthwhile price for stability.
L2 regularization modifies the maximum likelihood objective by adding a penalty term proportional to the squared L2 norm of the coefficient vector. The regularized objective becomes:
$$\mathcal{L}{\lambda}(\boldsymbol{\beta}) = \sum{i=1}^{n} \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right] - \frac{\lambda}{2} |\boldsymbol{\beta}|_2^2$$
or equivalently, we minimize the penalized negative log-likelihood:
$$J(\boldsymbol{\beta}) = -\sum_{i=1}^{n} \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right] + \frac{\lambda}{2} \sum_{j=1}^{p} \beta_j^2$$
where:
By convention, the intercept term $\beta_0$ is typically not regularized. Penalizing the intercept would bias the model toward predicting equal class probabilities, which is inappropriate when classes are imbalanced. The penalty applies only to the coefficients $\beta_1, \ldots, \beta_p$ associated with features.
The penalized formulation is equivalent to a constrained optimization problem. For every value of $\lambda$, there exists a corresponding constraint $t$ such that:
$$\begin{aligned} \text{maximize} \quad & \mathcal{L}(\boldsymbol{\beta}) \ \text{subject to} \quad & |\boldsymbol{\beta}|_2^2 \leq t \end{aligned}$$
This constrained formulation is often more intuitive: we're maximizing the likelihood subject to a "budget" constraint on how large coefficients can be. The relationship between $\lambda$ and $t$ is monotonic—larger $\lambda$ corresponds to smaller $t$ (tighter constraint).
Geometric Interpretation: The constraint defines an ellipsoid in parameter space (specifically, a hypersphere when all features are on the same scale). We seek the point on this ellipsoid that maximizes the log-likelihood.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npfrom scipy.special import expit # Numerically stable sigmoid def l2_regularized_logistic_loss(beta, X, y, lambda_reg, fit_intercept=True): """ Compute the L2-regularized logistic regression loss. Parameters ---------- beta : array-like, shape (p,) or (p+1,) Coefficient vector (includes intercept if fit_intercept=True) X : array-like, shape (n, p) Feature matrix y : array-like, shape (n,) Binary labels (0 or 1) lambda_reg : float Regularization strength (lambda >= 0) fit_intercept : bool Whether first element of beta is the intercept Returns ------- loss : float The penalized negative log-likelihood """ n = X.shape[0] # Separate intercept from feature coefficients if fit_intercept: intercept = beta[0] coef = beta[1:] linear_pred = X @ coef + intercept else: linear_pred = X @ beta coef = beta # Compute probabilities using numerically stable sigmoid prob = expit(linear_pred) # Clip probabilities to avoid log(0) eps = 1e-15 prob = np.clip(prob, eps, 1 - eps) # Negative log-likelihood (cross-entropy loss) nll = -np.sum(y * np.log(prob) + (1 - y) * np.log(1 - prob)) # L2 penalty on feature coefficients (not intercept) l2_penalty = 0.5 * lambda_reg * np.sum(coef ** 2) return nll + l2_penalty def l2_regularized_logistic_gradient(beta, X, y, lambda_reg, fit_intercept=True): """ Compute the gradient of the L2-regularized logistic regression loss. The gradient combines the log-likelihood gradient with the L2 penalty gradient. """ n = X.shape[0] if fit_intercept: intercept = beta[0] coef = beta[1:] linear_pred = X @ coef + intercept else: linear_pred = X @ beta coef = beta # Prediction residuals: (p_i - y_i) prob = expit(linear_pred) residuals = prob - y if fit_intercept: # Gradient w.r.t. intercept (no regularization) grad_intercept = np.sum(residuals) # Gradient w.r.t. coefficients (includes L2 penalty) grad_coef = X.T @ residuals + lambda_reg * coef return np.concatenate([[grad_intercept], grad_coef]) else: return X.T @ residuals + lambda_reg * betaDifferent software packages use different parameterizations of the regularization strength:
| Framework | Parameterization | Relationship to $\lambda$ |
|---|---|---|
| scikit-learn | $C = 1/\lambda$ | Inverse regularization strength |
| statsmodels | $\alpha = \lambda$ | Direct regularization strength |
| R (glmnet) | $\lambda$ with scaling | $\lambda_{\text{glmnet}} \cdot n = \lambda$ |
scikit-learn convention: Uses $C$ where larger values mean less regularization (confusing but historical). The objective becomes:
$$J(\boldsymbol{\beta}) = \frac{1}{C} \cdot \frac{1}{2}|\boldsymbol{\beta}|2^2 + \frac{1}{n}\sum{i=1}^{n} \ell(y_i, \hat{y}_i)$$
Always verify the parameterization when using new software to ensure correct hyperparameter specification.
L2 regularization implements shrinkage: it systematically pulls coefficient estimates toward zero. This isn't merely a computational trick but has profound statistical consequences.
Consider the gradient of the L2 penalty:
$$\frac{\partial}{\partial \beta_j} \left( \frac{\lambda}{2} \sum_k \beta_k^2 \right) = \lambda \beta_j$$
This gradient is proportional to the coefficient value itself. Large coefficients incur larger gradients pushing them toward zero, while small coefficients experience less penalty. The result is proportional shrinkage:
$$\hat{\beta}_j^{\text{ridge}} \approx \frac{1}{1 + \lambda/\text{curvature}} \cdot \hat{\beta}_j^{\text{MLE}}$$
where "curvature" refers to the second derivative of the log-likelihood. Coefficients are shrunk toward zero by a factor that depends on both $\lambda$ and how well-determined they are by the data.
Unlike L1 regularization (Lasso), L2 regularization never shrinks coefficients exactly to zero. The penalty gradient vanishes as $\beta_j \to 0$, so there's always a residual contribution. This means L2 regularization cannot perform feature selection—all features remain in the model with (typically) small but non-zero coefficients.
The geometry of L2 regularization illuminates why it works:
Likelihood Contours: In the two-coefficient case, the log-likelihood defines contour lines (curves of constant likelihood) in the $(\beta_1, \beta_2)$ plane. The MLE sits at the center of these concentric ellipses.
Constraint Region: The L2 constraint $\beta_1^2 + \beta_2^2 \leq t$ defines a circular disk centered at the origin.
Solution Location: The regularized solution occurs where the highest-valued likelihood contour touches the constraint disk. Due to the circular shape, this tangent point typically has both coefficients non-zero but smaller in magnitude than the MLE.
L2 regularization handles correlated features gracefully through coefficient sharing:
When two features $x_1$ and $x_2$ are highly correlated and both predictive of $y$:
This behavior arises because the L2 penalty is minimized when equal total weight is spread across correlated features rather than concentrated in one. If $\beta_1 + \beta_2 = c$ is required for good fit, then $\beta_1 = \beta_2 = c/2$ minimizes $\beta_1^2 + \beta_2^2$.
| Property | Unregularized MLE | L2 Regularized |
|---|---|---|
| Coefficient magnitude | Can be arbitrarily large | Bounded by regularization strength |
| Correlated features | Unstable, arbitrary assignment | Weights distributed evenly |
| Feature selection | Implicit via p-values | All features retained |
| Perfect separation | Coefficients diverge | Finite, stable solution |
| Variance | Can be very high | Substantially reduced |
| Bias | Zero (asymptotically) | Small, controlled bias |
The L2-regularized logistic regression objective is strictly convex when $\lambda > 0$. This has profound computational implications:
Original Log-Likelihood: The negative log-likelihood is convex (the sigmoid function's log-likelihood is concave), but not strictly convex—the Hessian can be singular when features are collinear.
Adding L2 Penalty: The L2 penalty $\frac{\lambda}{2}|\boldsymbol{\beta}|_2^2$ is strictly convex with Hessian $\lambda \mathbf{I}$. Adding it to any convex function yields a strictly convex function.
Consequence: The regularized objective has a unique global minimum. Any local minimum is the global minimum, and gradient-based methods are guaranteed to converge to it.
For efficient optimization, we need the gradient and Hessian of the regularized objective.
Gradient (for minimization):
$$\nabla J(\boldsymbol{\beta}) = \sum_{i=1}^{n} (p_i - y_i) \mathbf{x}_i + \lambda \boldsymbol{\beta}$$
Or in matrix form:
$$\nabla J(\boldsymbol{\beta}) = \mathbf{X}^\top (\mathbf{p} - \mathbf{y}) + \lambda \boldsymbol{\beta}$$
where $\mathbf{p} = [p_1, \ldots, p_n]^\top$ is the vector of predicted probabilities.
Hessian:
$$\mathbf{H} = \nabla^2 J(\boldsymbol{\beta}) = \mathbf{X}^\top \mathbf{W} \mathbf{X} + \lambda \mathbf{I}$$
where $\mathbf{W} = \text{diag}(p_1(1-p_1), \ldots, p_n(1-p_n))$ is a diagonal matrix of variance weights.
The regularization term $\lambda \mathbf{I}$ ensures the Hessian is always positive definite, guaranteeing well-conditioned optimization.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
import numpy as npfrom scipy.optimize import minimizefrom scipy.special import expit class L2RegularizedLogisticRegression: """ L2-Regularized Logistic Regression using Newton-Raphson (IRLS). This implementation demonstrates the IRLS algorithm with L2 regularization, which is equivalent to Newton's method for logistic regression. """ def __init__(self, lambda_reg=1.0, max_iter=100, tol=1e-6): self.lambda_reg = lambda_reg self.max_iter = max_iter self.tol = tol self.coef_ = None self.intercept_ = None self.n_iter_ = 0 def fit(self, X, y): """ Fit the L2-regularized logistic regression model using IRLS. The IRLS update is: beta_new = (X'WX + lambda*I)^{-1} * X'W * z where: - W = diag(p_i * (1 - p_i)) is the weight matrix - z = X*beta + W^{-1}(y - p) is the working response """ n, p = X.shape # Add intercept column X_aug = np.column_stack([np.ones(n), X]) # Initialize coefficients at zero beta = np.zeros(p + 1) # Regularization matrix (don't penalize intercept) reg_matrix = self.lambda_reg * np.eye(p + 1) reg_matrix[0, 0] = 0 # No regularization on intercept for iteration in range(self.max_iter): # Compute predicted probabilities linear_pred = X_aug @ beta p_pred = expit(linear_pred) # Compute weights (variance of binomial) weights = p_pred * (1 - p_pred) weights = np.maximum(weights, 1e-10) # Numerical stability # Weight matrix W = np.diag(weights) # Working response (adjusted dependent variable) z = linear_pred + (y - p_pred) / weights # Compute the regularized Hessian H_reg = X_aug.T @ W @ X_aug + reg_matrix # Compute the right-hand side rhs = X_aug.T @ W @ z # Solve the normal equations beta_new = np.linalg.solve(H_reg, rhs) # Check for convergence diff = np.max(np.abs(beta_new - beta)) beta = beta_new if diff < self.tol: self.n_iter_ = iteration + 1 break else: self.n_iter_ = self.max_iter # Store results self.intercept_ = beta[0] self.coef_ = beta[1:] return self def predict_proba(self, X): """Predict class probabilities.""" linear_pred = X @ self.coef_ + self.intercept_ prob_1 = expit(linear_pred) return np.column_stack([1 - prob_1, prob_1]) def predict(self, X, threshold=0.5): """Predict class labels.""" return (self.predict_proba(X)[:, 1] >= threshold).astype(int) # Demonstration of IRLS convergenceif __name__ == "__main__": np.random.seed(42) # Generate sample data n, p = 200, 10 X = np.random.randn(n, p) true_coef = np.array([1.5, -1.0, 0.5, 0, 0, 0, 0, 0, 0, 0]) prob = expit(X @ true_coef) y = (np.random.rand(n) < prob).astype(int) # Fit models with different regularization strengths for lam in [0.01, 0.1, 1.0, 10.0]: model = L2RegularizedLogisticRegression(lambda_reg=lam) model.fit(X, y) print(f"lambda={lam:5.2f}: coef magnitude = {np.linalg.norm(model.coef_):.4f}, " f"iterations = {model.n_iter_}")Several algorithms efficiently solve the L2-regularized problem:
Newton-Raphson / IRLS: The Iteratively Reweighted Least Squares algorithm is Newton's method applied to logistic regression. With L2 regularization, the update becomes:
$$\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - (\mathbf{H} + \lambda \mathbf{I})^{-1} \nabla J(\boldsymbol{\beta}^{(t)})$$
This converges quadratically near the optimum.
L-BFGS: A quasi-Newton method that approximates the Hessian using gradient history. Efficient for medium-scale problems where the Hessian is expensive to compute or store.
Stochastic Gradient Descent (SGD): For massive datasets, SGD updates parameters using gradient estimates from mini-batches. The L2 penalty gradient is trivially incorporated as $\lambda \boldsymbol{\beta}$.
L2 regularization is scale-dependent. If feature A is measured in meters and feature B in millimeters, the coefficient for B will be 1000x smaller and experience 1,000,000x less penalty. Always standardize features (zero mean, unit variance) before applying L2 regularization. This ensures all coefficients are penalized on a fair, comparable scale.
L2 regularization admits a compelling Bayesian interpretation: it corresponds to Maximum A Posteriori (MAP) estimation with a Gaussian prior on the coefficients.
Setup:
MAP Objective: We maximize the posterior probability:
$$\hat{\boldsymbol{\beta}}{\text{MAP}} = \arg\max{\boldsymbol{\beta}} , p(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) = \arg\max_{\boldsymbol{\beta}} , p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$
Taking the log:
$$\log p(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) \propto \underbrace{\sum_i [y_i \log p_i + (1-y_i)\log(1-p_i)]}_{\text{Log-likelihood}} - \underbrace{\frac{1}{2\tau^2} |\boldsymbol{\beta}|2^2}{\text{Log-prior}}$$
Setting $\lambda = 1/\tau^2$ recovers the L2-penalized objective exactly.
The regularization parameter $\lambda = 1/\tau^2$ controls our prior beliefs. Large $\lambda$ (small $\tau^2$) means we have a strong prior belief that coefficients should be near zero—the prior is "tight." Small $\lambda$ (large $\tau^2$) means we have weak prior beliefs and let the data dominate—the prior is "diffuse."
The Gaussian prior has several important properties that explain L2 regularization behavior:
Continuous Density at Zero: Unlike the Laplace prior (L1), the Gaussian prior has a smooth, continuous density at zero with zero derivative. This explains why L2 regularization shrinks coefficients smoothly toward zero without ever reaching exactly zero.
Quadratic Penalization: The log of a Gaussian is a quadratic function. This ensures the penalty is differentiable everywhere, enabling efficient gradient-based optimization.
Independence Assumption: The isotropic prior $\mathcal{N}(\mathbf{0}, \tau^2 \mathbf{I})$ assumes coefficients are independent a priori. This is a simplifying assumption—in practice, domain knowledge might suggest correlated priors, leading to more sophisticated regularizers.
While we typically compute only the MAP estimate, the full Bayesian approach would compute the posterior distribution:
$$p(\boldsymbol{\beta} | \mathbf{y}, \mathbf{X}) \propto p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$
This posterior is not analytically tractable for logistic regression (unlike linear regression with conjugate priors). However, we can approximate it using:
These approaches provide uncertainty quantification—not just point estimates but credible intervals for coefficients and predictions.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import numpy as npfrom scipy.special import expitfrom scipy.stats import norm def laplace_approximation(X, y, beta_map, lambda_reg): """ Compute the Laplace approximation to the posterior. The posterior is approximated as: p(beta | y, X) ≈ N(beta_map, H^{-1}) where H is the Hessian of the negative log-posterior at beta_map. Parameters ---------- X : array, shape (n, p) Feature matrix (without intercept column) y : array, shape (n,) Binary labels beta_map : array, shape (p+1,) MAP estimate (includes intercept) lambda_reg : float Regularization strength Returns ------- beta_map : array Mean of approximate posterior cov_matrix : array, shape (p+1, p+1) Covariance of approximate posterior std_errors : array, shape (p+1,) Standard errors (sqrt of diagonal of covariance) """ n, p = X.shape X_aug = np.column_stack([np.ones(n), X]) # Compute predicted probabilities at MAP estimate linear_pred = X_aug @ beta_map p_pred = expit(linear_pred) # Weights for IRLS weights = p_pred * (1 - p_pred) W = np.diag(weights) # Hessian of negative log-likelihood H_nll = X_aug.T @ W @ X_aug # Hessian of negative log-prior (regularization) H_prior = lambda_reg * np.eye(p + 1) H_prior[0, 0] = 0 # No regularization on intercept # Total Hessian (negative log-posterior) H_total = H_nll + H_prior # Covariance is inverse Hessian cov_matrix = np.linalg.inv(H_total) std_errors = np.sqrt(np.diag(cov_matrix)) return beta_map, cov_matrix, std_errors def compute_credible_intervals(beta_map, std_errors, alpha=0.05): """ Compute credible intervals for coefficients. Parameters ---------- beta_map : array MAP estimates std_errors : array Standard errors from Laplace approximation alpha : float Significance level (default 0.05 for 95% intervals) Returns ------- intervals : array, shape (n_coef, 2) Lower and upper bounds of credible intervals """ z = norm.ppf(1 - alpha / 2) lower = beta_map - z * std_errors upper = beta_map + z * std_errors return np.column_stack([lower, upper]) # Example usageif __name__ == "__main__": np.random.seed(42) # Generate data n, p = 100, 5 X = np.random.randn(n, p) true_beta = np.array([0.5, 1.0, -0.5, 0.0, 0.0]) prob = expit(X @ true_beta) y = (np.random.rand(n) < prob).astype(int) # Fit L2-regularized model (using simple gradient descent for demo) from scipy.optimize import minimize def neg_log_posterior(beta, X, y, lam): X_aug = np.column_stack([np.ones(len(y)), X]) linear_pred = X_aug @ beta p = expit(linear_pred) p = np.clip(p, 1e-15, 1 - 1e-15) nll = -np.sum(y * np.log(p) + (1 - y) * np.log(1 - p)) penalty = 0.5 * lam * np.sum(beta[1:]**2) # Don't penalize intercept return nll + penalty lambda_reg = 1.0 beta_init = np.zeros(p + 1) result = minimize(neg_log_posterior, beta_init, args=(X, y, lambda_reg)) beta_map = result.x # Laplace approximation _, cov, se = laplace_approximation(X, y, beta_map, lambda_reg) intervals = compute_credible_intervals(beta_map, se) print("\nBayesian Logistic Regression with L2 Prior") print("-" * 60) print(f"{'Param':<8} {'Estimate':>10} {'Std Err':>10} {'95% CI':>20}") print("-" * 60) for i, name in enumerate(["Intercept"] + [f"x{j+1}" for j in range(p)]): print(f"{name:<8} {beta_map[i]:>10.4f} {se[i]:>10.4f} " f"[{intervals[i,0]:>7.4f}, {intervals[i,1]:>7.4f}]")Feature Standardization: As emphasized earlier, L2 regularization requires standardized features. The standard approach:
$$\tilde{x}{ij} = \frac{x{ij} - \bar{x}_j}{s_j}$$
where $\bar{x}_j$ and $s_j$ are the mean and standard deviation of feature $j$.
Binary and Categorical Features: For binary features (0/1), standardization still applies. For categorical features, use appropriate encoding (one-hot or effect coding) and standardize the resulting columns.
Target Encoding: The target variable $y$ should be 0/1 (not -1/+1 or other encodings) for standard log-loss formulation.
L2 regularization interacts with class imbalance in important ways:
Intercept Bias: With unbalanced classes (e.g., 1% positives), the fitted intercept captures the prior class probability. The intercept should not be regularized to preserve this calibration.
Weighted Samples: For highly imbalanced problems, weight the log-likelihood contributions:
$$J(\boldsymbol{\beta}) = -\sum_{i=1}^{n} w_i \left[ y_i \log p_i + (1-y_i) \log(1-p_i) \right] + \frac{\lambda}{2} |\boldsymbol{\beta}|_2^2$$
Common weighting schemes:
class_weight='balanced' in scikit-learn12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
from sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import cross_val_scoreimport numpy as np def create_l2_logistic_pipeline(C=1.0, class_weight=None): """ Create a complete pipeline for L2-regularized logistic regression. Parameters ---------- C : float Inverse regularization strength (1/lambda in our notation) class_weight : dict, 'balanced', or None Weights for handling class imbalance Returns ------- pipeline : sklearn.Pipeline Complete preprocessing + model pipeline """ return Pipeline([ ('scaler', StandardScaler()), # Critical: standardize features ('classifier', LogisticRegression( penalty='l2', C=C, class_weight=class_weight, solver='lbfgs', max_iter=1000, random_state=42 )) ]) def compare_regularization_strengths(X, y, C_values): """ Compare model performance across regularization strengths. Demonstrates the bias-variance tradeoff as C changes. """ results = [] for C in C_values: pipeline = create_l2_logistic_pipeline(C=C) # Cross-validation scores scores = cross_val_score( pipeline, X, y, cv=5, scoring='roc_auc' ) # Fit on full data to examine coefficients pipeline.fit(X, y) coef = pipeline.named_steps['classifier'].coef_[0] results.append({ 'C': C, 'lambda': 1/C, 'mean_auc': scores.mean(), 'std_auc': scores.std(), 'coef_l2_norm': np.linalg.norm(coef), 'max_abs_coef': np.max(np.abs(coef)), 'n_near_zero': np.sum(np.abs(coef) < 0.01) }) return results # Example: Demonstrating the regularization pathif __name__ == "__main__": from sklearn.datasets import make_classification # Generate classification data X, y = make_classification( n_samples=500, n_features=50, n_informative=10, n_redundant=20, n_clusters_per_class=2, random_state=42 ) # Test range of regularization strengths C_values = [100, 10, 1, 0.1, 0.01, 0.001] results = compare_regularization_strengths(X, y, C_values) print("\nL2 Regularization Comparison") print("=" * 80) print(f"{'C':>8} {'λ':>8} {'AUC':>12} {'||β||₂':>10} {'max|β|':>10}") print("-" * 80) for r in results: print(f"{r['C']:>8.3f} {r['lambda']:>8.3f} " f"{r['mean_auc']:.4f}±{r['std_auc']:.4f} " f"{r['coef_l2_norm']:>10.4f} {r['max_abs_coef']:>10.4f}")L2-regularized coefficients require careful interpretation:
Shrinkage Adjustment: Coefficients are systematically shrunk toward zero. Their magnitudes are biased estimates of the true effect sizes. Use them for prediction and ranking feature importance, but not for unbiased effect size estimation.
Relative Importance: Larger |β_j| indicates greater predictive importance, but the absolute scale depends on λ. Compare coefficients within a model, not across models with different λ.
Odds Ratio Interpretation: For standardized features, $e^{\beta_j}$ gives the multiplicative change in odds for a one-standard-deviation increase in feature $j$. This interpretation is approximate due to regularization bias.
L2 regularization improves prediction at the cost of interpretability. If precise coefficient estimation and hypothesis testing are primary goals, consider using unregularized logistic regression on a carefully selected feature set, or penalized regression methods designed for inference (like the de-biased lasso).
L2 regularization transforms logistic regression from a potentially unstable estimator into a robust, reliable classifier. We've covered the complete theory from multiple perspectives.
The next page explores L1 regularization (Lasso) for logistic regression, which differs fundamentally in geometry and behavior. While L2 shrinks coefficients smoothly, L1 can drive coefficients exactly to zero—enabling automatic feature selection. Understanding both prepares you for Elastic Net, which combines their strengths.