Loading learning content...
In the previous module, we confronted the harsh reality of overfitting—the phenomenon where our models fit the training data too well, capturing noise rather than signal, and consequently failing catastrophically on unseen data. We saw how high-dimensional settings amplify this problem, leading to exploding variance and unreliable predictions.
The question now becomes: how do we tame this variance explosion?
The answer lies in regularization, a family of techniques that constrain our model's flexibility to achieve better generalization. Among regularization methods, Ridge regression (L2 regularization) stands as one of the most elegant, theoretically well-understood, and practically effective approaches. It represents the foundation upon which more sophisticated regularization techniques are built.
By the end of this page, you will understand the L2 penalty from multiple perspectives: the optimization objective, the geometric intuition, the probabilistic interpretation, and the historical context. You will be able to write the Ridge regression objective function and explain exactly why it works.
Before diving into the L2 penalty, let's crystallize exactly what problem we're solving. Recall the Ordinary Least Squares (OLS) objective:
$$\hat{\boldsymbol{\beta}}{\text{OLS}} = \arg\min{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2$$
This minimizes the sum of squared residuals, finding the coefficient vector $\boldsymbol{\beta}$ that makes our predictions $\mathbf{X}\boldsymbol{\beta}$ as close as possible to the observed targets $\mathbf{y}$.
When does OLS fail? OLS becomes problematic in several scenarios:
A concrete example:
Consider fitting a polynomial of degree 15 to 20 data points. OLS will find coefficients that pass exactly through all points (or very close to them). But these coefficients might be enormous—values like $\beta_7 = 1,234,567$ and $\beta_8 = -1,234,589$ that largely cancel each other out.
This cancellation is a hallmark of overfitting: the model exploits tiny numerical differences rather than learning genuine patterns. When we encounter a new data point, these huge coefficients amplify any small deviation, producing wildly inaccurate predictions.
The insight: Large coefficients are a symptom of overfitting. If we can discourage large coefficients, we can reduce overfitting. This is precisely what the L2 penalty accomplishes.
Regularization imposes a "complexity budget" on the model. By penalizing large coefficients, we force the model to find simpler solutions that rely on genuine patterns rather than noise. The L2 penalty specifically discourages large coefficient magnitudes while still allowing for necessary flexibility.
Ridge regression modifies the OLS objective by adding a penalty term that discourages large coefficient values. The L2 penalty (also called the squared Euclidean norm or Tikhonov regularization) measures the sum of squared coefficients:
$$\text{L2 penalty} = |\boldsymbol{\beta}|2^2 = \sum{j=1}^{p} \beta_j^2$$
The complete Ridge regression objective combines the OLS loss with this penalty:
$$\hat{\boldsymbol{\beta}}{\text{Ridge}} = \arg\min{\boldsymbol{\beta}} \left{ |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda |\boldsymbol{\beta}|_2^2 \right}$$
Let's dissect each component of this formulation:
| Component | Mathematical Expression | Role |
|---|---|---|
| Loss term | $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2$ | Measures fit to training data (sum of squared residuals) |
| Penalty term | $|\boldsymbol{\beta}|_2^2$ | Measures model complexity (sum of squared coefficients) |
| Regularization parameter | $\lambda \geq 0$ | Controls tradeoff between fit and complexity |
| Coefficients | $\boldsymbol{\beta} = (\beta_1, \ldots, \beta_p)^T$ | Parameters to be estimated |
The regularization parameter $\lambda$:
The parameter $\lambda$ (also often denoted $\alpha$ in some libraries) is the regularization strength or hyperparameter. It controls how much we penalize large coefficients:
The challenge of choosing the right $\lambda$ is a central topic we'll address later in this module.
By convention, the intercept $\beta_0$ is typically not penalized. This is because penalizing the intercept would pull predictions toward zero rather than toward the data mean. In practice, we center the response variable (subtract its mean) so the intercept becomes zero, then apply Ridge to the remaining coefficients.
Expanded form:
Writing out the objective explicitly:
$$J(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left( y_i - \sum_{j=1}^{p} x_{ij}\beta_j \right)^2 + \lambda \sum_{j=1}^{p} \beta_j^2$$
This is a quadratic function in $\boldsymbol{\beta}$—it's bowl-shaped with a unique minimum (as long as the function is strictly convex, which we'll verify). This mathematical property is crucial: it guarantees that Ridge regression has a unique, closed-form solution.
The Ridge objective can be expressed in two equivalent ways, connected by Lagrangian duality. Understanding both perspectives deepens our geometric intuition.
Penalized (Lagrangian) form:
$$\min_{\boldsymbol{\beta}} \left{ |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda |\boldsymbol{\beta}|_2^2 \right}$$
Constrained form:
$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 \quad \text{subject to} \quad |\boldsymbol{\beta}|_2^2 \leq t$$
These two formulations are equivalent: for every $\lambda > 0$, there exists a $t > 0$ such that the solutions coincide, and vice versa. The relationship between $\lambda$ and $t$ is determined by the KKT conditions.
The constrained form has a beautiful geometric interpretation: we're finding the point on the "sphere" $|\boldsymbol{\beta}|_2^2 \leq t$ that minimizes the squared error. As $t$ shrinks, we restrict ourselves to smaller spheres, forcing coefficients toward zero. The constraint $t$ acts as a "budget" for coefficient magnitudes.
Why the L2 norm specifically?
The choice of the L2 (Euclidean) norm is not arbitrary. Consider properties we desire from a regularization penalty:
Alternative norms (L1, L∞, etc.) have different properties, leading to different regularization behaviors, which we'll explore in subsequent modules.
| Aspect | Penalized Form | Constrained Form |
|---|---|---|
| Objective | Minimize loss + penalty | Minimize loss only |
| Constraint | None (unconstrained optimization) | $|\boldsymbol{\beta}|_2^2 \leq t$ |
| Hyperparameter | $\lambda$ (penalty strength) | $t$ (constraint radius) |
| Relationship | $\lambda$ is Lagrange multiplier | $t$ determines feasible region |
| When $\lambda = 0$ | OLS solution | No constraint ($t = \infty$) |
| When $\lambda \to \infty$ | $\boldsymbol{\beta} \to \mathbf{0}$ | $t \to 0$ |
To truly understand Ridge regression, we must visualize what's happening in the coefficient space. Consider a simple case with two coefficients, $\beta_1$ and $\beta_2$.
The L2 constraint region:
The constraint $|\boldsymbol{\beta}|_2^2 \leq t$ defines a ball (or disk in 2D) centered at the origin:
$$\beta_1^2 + \beta_2^2 \leq t$$
This is a circular region in 2D, a sphere in 3D, and a hypersphere in higher dimensions. The radius of this ball is $\sqrt{t}$.
The RSS contours:
The residual sum of squares (RSS) without regularization defines elliptical contours in coefficient space. The OLS solution $\hat{\boldsymbol{\beta}}_{\text{OLS}}$ is the center of these ellipses—the point where RSS is minimized.
Finding the Ridge solution:
The Ridge solution is found where the RSS contour ellipse is tangent to the L2 constraint ball. This tangent point has several important properties:
Geometric insight:
As we shrink the ball (decrease $t$ or increase $\lambda$), the tangent point moves toward the origin. This is the shrinkage effect—coefficients are pulled toward zero, reducing their magnitudes while still fitting the data as well as possible given the constraint.
The shrinkage interpretation reveals why Ridge regression combats overfitting. Large coefficients arise when the model exploits noise in the training data. By constraining coefficient magnitudes, we prevent this exploitation. The model must find patterns that work with smaller, more stable coefficients—patterns that tend to generalize better.
The smoothness of the L2 ball:
Unlike the L1 ball (which has corners—we'll see this with Lasso), the L2 ball is perfectly smooth. This smoothness has important implications:
This is both a strength and weakness: Ridge is stable and well-behaved, but it doesn't perform automatic feature selection (all features remain in the model).
The L2 penalty we use in Ridge regression has a rich history extending beyond statistics into the theory of ill-posed problems in mathematics and physics.
Origins in inverse problems:
In the 1940s-1960s, mathematicians encountered "ill-posed" problems—equations that either have no solution, multiple solutions, or solutions that are extremely sensitive to input perturbations. These arise frequently in physics (heat conduction, wave propagation) and engineering (signal reconstruction, image processing).
Andrey Tikhonov (Russian mathematician, 1906-1993) developed a systematic approach called Tikhonov regularization in 1963. The core idea: when solving $\mathbf{A}\mathbf{x} = \mathbf{b}$ where $\mathbf{A}$ is ill-conditioned, instead solve:
$$\min_{\mathbf{x}} \left{ |\mathbf{A}\mathbf{x} - \mathbf{b}|_2^2 + \lambda |\mathbf{x}|_2^2 \right}$$
This is precisely the Ridge regression objective.
| Field | Name | Context |
|---|---|---|
| Statistics | Ridge regression | Regression with correlated predictors |
| Mathematics | Tikhonov regularization | Solving ill-posed inverse problems |
| Machine Learning | L2 regularization | Neural network weight decay |
| Numerical Analysis | Damped least squares | Stabilizing matrix inversions |
| Geophysics | Occam inversion | Preferring simpler geological models |
Ridge regression in statistics:
In statistics, Arthur Hoerl and Robert Kennard introduced "Ridge regression" in their seminal 1970 paper, addressing multicollinearity in regression. They showed that biased Ridge estimators often have lower mean squared error than unbiased OLS estimators—a breakthrough that influenced decades of statistical practice.
The name's origin:
The term "ridge" comes from the ridge added to the diagonal of $\mathbf{X}^T\mathbf{X}$. When we solve the normal equations with the L2 penalty:
$$(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}$$
The term $\lambda \mathbf{I}$ creates a "ridge" along the diagonal, lifting the eigenvalues and stabilizing the inversion. This ridge prevents the matrix from being singular—hence the name.
The fact that L2 regularization appears independently across mathematics, statistics, physics, and machine learning—solving diverse problems from image reconstruction to neural network training—suggests something fundamental. The L2 penalty captures a universal principle: when data is noisy or limited, prefer simpler solutions.
Ridge regression has an elegant interpretation in Bayesian statistics: the L2 penalty corresponds to placing a Gaussian (Normal) prior on the coefficients.
The Bayesian setup:
Consider the linear model with Gaussian noise:
$$y_i = \mathbf{x}_i^T \boldsymbol{\beta} + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)$$
In a Bayesian framework, we treat $\boldsymbol{\beta}$ as a random variable with a prior distribution. If we place independent Gaussian priors on each coefficient:
$$\beta_j \sim \mathcal{N}(0, \tau^2) \quad \text{for } j = 1, \ldots, p$$
The prior density is:
$$p(\boldsymbol{\beta}) \propto \exp\left( -\frac{1}{2\tau^2} \sum_{j=1}^{p} \beta_j^2 \right) = \exp\left( -\frac{1}{2\tau^2} |\boldsymbol{\beta}|_2^2 \right)$$
Maximum a Posteriori (MAP) estimation:
By Bayes' theorem, the posterior is proportional to the likelihood times the prior:
$$p(\boldsymbol{\beta} | \mathbf{y}) \propto p(\mathbf{y} | \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$
The log-posterior (ignoring constants) is:
$$\log p(\boldsymbol{\beta} | \mathbf{y}) = -\frac{1}{2\sigma^2} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 - \frac{1}{2\tau^2} |\boldsymbol{\beta}|_2^2 + \text{const}$$
Maximizing this is equivalent to minimizing:
$$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \frac{\sigma^2}{\tau^2} |\boldsymbol{\beta}|_2^2$$
Comparing with the Ridge objective, we identify:
$$\lambda = \frac{\sigma^2}{\tau^2}$$
Ridge regression is exactly equivalent to Maximum A Posteriori (MAP) estimation with a Gaussian prior on coefficients. The regularization parameter $\lambda$ is the ratio of noise variance to prior variance. Strong regularization ($\lambda$ large) corresponds to a tight prior ($\tau^2$ small) that strongly pulls coefficients toward zero.
Implications of the Gaussian prior:
Centered at zero: The prior expects coefficients to be close to zero, encoding a belief that most features have small effects.
Finite variance: Unlike improper priors, the Gaussian prior is proper—it integrates to one. This ensures a proper posterior.
Smooth shrinkage: The Gaussian prior penalizes large coefficients quadratically, producing smooth shrinkage rather than hard thresholding.
Equal treatment: Independent, identical priors treat all coefficients symmetrically—no feature is a priori more important than another.
Connection to prior knowledge:
The Bayesian interpretation tells us that Ridge regression encodes a specific belief: before seeing data, we expect coefficients to be normally distributed around zero. If domain knowledge suggests otherwise (e.g., certain features should have larger effects), alternative priors might be more appropriate.
| Component | Frequentist View | Bayesian View |
|---|---|---|
| $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2$ | Residual sum of squares | Negative log-likelihood |
| $|\boldsymbol{\beta}|_2^2$ | Penalty term | Negative log-prior |
| $\lambda$ | Regularization strength | Prior precision / noise variance |
| $\hat{\boldsymbol{\beta}}_{\text{Ridge}}$ | Penalized least squares solution | MAP estimate |
Before applying Ridge regression, feature scaling is essential. This is not merely a best practice—it's a mathematical necessity that fundamentally affects the solution.
Why scaling matters:
The L2 penalty $|\boldsymbol{\beta}|_2^2 = \sum_j \beta_j^2$ treats all coefficients equally. But coefficient magnitude depends on the scale of the corresponding feature:
This creates an inconsistency: the solution depends on measurement units, which is scientifically unacceptable.
Applying Ridge regression to unscaled features produces mathematically valid but practically meaningless results. Features with larger scales will be under-penalized relative to their importance, leading to suboptimal shrinkage patterns. Always standardize features before Ridge regression.
Standard preprocessing:
The standard approach is to standardize each feature to have zero mean and unit variance:
$$\tilde{x}{ij} = \frac{x{ij} - \bar{x}_j}{s_j}$$
where $\bar{x}_j$ is the mean of feature $j$ and $s_j$ is its standard deviation.
After standardization:
Centering the response:
The response $\mathbf{y}$ should also be centered (mean subtracted) so that we can fit a model without an intercept. The intercept can be recovered as:
$$\hat{\beta}0 = \bar{y} - \sum{j=1}^{p} \bar{x}_j \hat{\beta}_j$$
where $\hat{\beta}_j$ are the Ridge coefficients fitted on centered, standardized data.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
import numpy as npfrom sklearn.preprocessing import StandardScaler def prepare_data_for_ridge(X, y): """ Prepares data for Ridge regression with proper scaling. Parameters: ----------- X : array-like of shape (n_samples, n_features) Feature matrix y : array-like of shape (n_samples,) Target vector Returns: -------- X_scaled : ndarray Standardized features (mean=0, std=1) y_centered : ndarray Centered response scaler : StandardScaler Fitted scaler (retain for inverse transform) y_mean : float Mean of y (for intercept recovery) """ # Standardize features: zero mean, unit variance scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Center response y_mean = np.mean(y) y_centered = y - y_mean return X_scaled, y_centered, scaler, y_mean def recover_original_coefficients(beta_scaled, scaler, y_mean): """ Converts coefficients from scaled space back to original feature space. Parameters: ----------- beta_scaled : ndarray Ridge coefficients from scaled data (excluding intercept) scaler : StandardScaler The scaler used during preprocessing y_mean : float Original mean of y Returns: -------- beta_original : ndarray Coefficients in original feature space intercept : float Recovered intercept term """ # Scale factors: coefficients are divided by std of each feature beta_original = beta_scaled / scaler.scale_ # Intercept: y_mean - sum(beta_original * x_mean) intercept = y_mean - np.dot(scaler.mean_, beta_original) return beta_original, interceptWe have established the theoretical foundations of the L2 penalty, understanding it from multiple complementary perspectives:
What's next:
With the L2 penalty formulation established, we're ready to derive the closed-form solution for Ridge regression. We'll see that adding the regularization term makes the problem not just well-defined, but elegantly solvable—with guaranteed existence and uniqueness of the solution.
You now understand the L2 penalty from optimization, geometric, historical, and Bayesian perspectives. This multi-faceted understanding will serve you well as we develop Ridge regression into a complete, practical tool for regularized regression.