Machine LearningRegularization Theory

Ridge Regression (L2 Regularization)

LevelIntermediate

Duration90 mins

TopicRegularization Theory

1 / 5

L2 Penalty Formulation

The Regularization Imperative

In the previous module, we confronted the harsh reality of overfitting—the phenomenon where our models fit the training data too well, capturing noise rather than signal, and consequently failing catastrophically on unseen data. We saw how high-dimensional settings amplify this problem, leading to exploding variance and unreliable predictions.

The question now becomes: how do we tame this variance explosion?

The answer lies in regularization, a family of techniques that constrain our model's flexibility to achieve better generalization. Among regularization methods, Ridge regression (L2 regularization) stands as one of the most elegant, theoretically well-understood, and practically effective approaches. It represents the foundation upon which more sophisticated regularization techniques are built.

What You Will Learn

By the end of this page, you will understand the L2 penalty from multiple perspectives: the optimization objective, the geometric intuition, the probabilistic interpretation, and the historical context. You will be able to write the Ridge regression objective function and explain exactly why it works.

The Problem That Demands Regularization

Before diving into the L2 penalty, let's crystallize exactly what problem we're solving. Recall the Ordinary Least Squares (OLS) objective:

$$\hat{\boldsymbol{\beta}}{\text{OLS}} = \arg\min{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2$$

This minimizes the sum of squared residuals, finding the coefficient vector $\boldsymbol{\beta}$ that makes our predictions $\mathbf{X}\boldsymbol{\beta}$ as close as possible to the observed targets $\mathbf{y}$.

When does OLS fail? OLS becomes problematic in several scenarios:

OLS Failure Modes

•High-dimensional data (p ≈ n or p > n): When the number of features p approaches or exceeds the number of observations n, the design matrix $\mathbf{X}^T\mathbf{X}$ becomes singular or ill-conditioned. The OLS solution either doesn't exist or is extremely unstable.
•Multicollinearity: When features are highly correlated, small changes in the data can cause massive swings in the estimated coefficients. The condition number of $\mathbf{X}^T\mathbf{X}$ becomes large, amplifying numerical errors.
•Overfitting: Even when a unique OLS solution exists, it may have enormous coefficients that perfectly interpolate training data but generalize poorly. The model captures noise rather than signal.
•Variance explosion: The variance of OLS estimates scales with $(\mathbf{X}^T\mathbf{X})^{-1}$. When this matrix is ill-conditioned, coefficient variances become prohibitively large.

A concrete example:

Consider fitting a polynomial of degree 15 to 20 data points. OLS will find coefficients that pass exactly through all points (or very close to them). But these coefficients might be enormous—values like $\beta_7 = 1,234,567$ and $\beta_8 = -1,234,589$ that largely cancel each other out.

This cancellation is a hallmark of overfitting: the model exploits tiny numerical differences rather than learning genuine patterns. When we encounter a new data point, these huge coefficients amplify any small deviation, producing wildly inaccurate predictions.

The insight: Large coefficients are a symptom of overfitting. If we can discourage large coefficients, we can reduce overfitting. This is precisely what the L2 penalty accomplishes.

The Core Intuition

Regularization imposes a "complexity budget" on the model. By penalizing large coefficients, we force the model to find simpler solutions that rely on genuine patterns rather than noise. The L2 penalty specifically discourages large coefficient magnitudes while still allowing for necessary flexibility.

The L2 Penalty — Mathematical Formulation

Ridge regression modifies the OLS objective by adding a penalty term that discourages large coefficient values. The L2 penalty (also called the squared Euclidean norm or Tikhonov regularization) measures the sum of squared coefficients:

$$\text{L2 penalty} = |\boldsymbol{\beta}|2^2 = \sum{j=1}^{p} \beta_j^2$$

The complete Ridge regression objective combines the OLS loss with this penalty:

$$\hat{\boldsymbol{\beta}}{\text{Ridge}} = \arg\min{\boldsymbol{\beta}} \left{ |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda |\boldsymbol{\beta}|_2^2 \right}$$

Let's dissect each component of this formulation:

Components of the Ridge Regression Objective
Component	Mathematical Expression	Role
Loss term	$\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2$	Measures fit to training data (sum of squared residuals)
Penalty term	$\|\boldsymbol{\beta}\|_2^2$	Measures model complexity (sum of squared coefficients)
Regularization parameter	$\lambda \geq 0$	Controls tradeoff between fit and complexity
Coefficients	$\boldsymbol{\beta} = (\beta_1, \ldots, \beta_p)^T$	Parameters to be estimated

The regularization parameter $\lambda$:

The parameter $\lambda$ (also often denoted $\alpha$ in some libraries) is the regularization strength or hyperparameter. It controls how much we penalize large coefficients:

$\lambda = 0$: No penalty; Ridge reduces to OLS. We recover the unregularized solution.
$\lambda \to \infty$: Infinite penalty; all coefficients shrink to zero. The model becomes a constant prediction.
$\lambda > 0$ (finite): A balance between fitting the data and keeping coefficients small.

The challenge of choosing the right $\lambda$ is a central topic we'll address later in this module.

Intercept Term

By convention, the intercept $\beta_0$ is typically not penalized. This is because penalizing the intercept would pull predictions toward zero rather than toward the data mean. In practice, we center the response variable (subtract its mean) so the intercept becomes zero, then apply Ridge to the remaining coefficients.

Expanded form:

Writing out the objective explicitly:

$$J(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left( y_i - \sum_{j=1}^{p} x_{ij}\beta_j \right)^2 + \lambda \sum_{j=1}^{p} \beta_j^2$$

This is a quadratic function in $\boldsymbol{\beta}$—it's bowl-shaped with a unique minimum (as long as the function is strictly convex, which we'll verify). This mathematical property is crucial: it guarantees that Ridge regression has a unique, closed-form solution.

Constrained vs. Penalized Formulation

The Ridge objective can be expressed in two equivalent ways, connected by Lagrangian duality. Understanding both perspectives deepens our geometric intuition.

Penalized (Lagrangian) form:

$$\min_{\boldsymbol{\beta}} \left{ |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda |\boldsymbol{\beta}|_2^2 \right}$$

Constrained form:

$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 \quad \text{subject to} \quad |\boldsymbol{\beta}|_2^2 \leq t$$

These two formulations are equivalent: for every $\lambda > 0$, there exists a $t > 0$ such that the solutions coincide, and vice versa. The relationship between $\lambda$ and $t$ is determined by the KKT conditions.

Geometric Intuition

The constrained form has a beautiful geometric interpretation: we're finding the point on the "sphere" $|\boldsymbol{\beta}|_2^2 \leq t$ that minimizes the squared error. As $t$ shrinks, we restrict ourselves to smaller spheres, forcing coefficients toward zero. The constraint $t$ acts as a "budget" for coefficient magnitudes.

Why the L2 norm specifically?

The choice of the L2 (Euclidean) norm is not arbitrary. Consider properties we desire from a regularization penalty:

Differentiability everywhere: The L2 norm is smooth, allowing gradient-based optimization.
Isotropy: The penalty treats all directions in coefficient space equally.
Rotational invariance: The L2 norm is preserved under orthogonal transformations.
Strict convexity: When combined with the convex loss, guarantees a unique solution.
Computational tractability: Leads to a closed-form solution.

Alternative norms (L1, L∞, etc.) have different properties, leading to different regularization behaviors, which we'll explore in subsequent modules.

Equivalence Between Formulations
Aspect	Penalized Form	Constrained Form
Objective	Minimize loss + penalty	Minimize loss only
Constraint	None (unconstrained optimization)	$\|\boldsymbol{\beta}\|_2^2 \leq t$
Hyperparameter	$\lambda$ (penalty strength)	$t$ (constraint radius)
Relationship	$\lambda$ is Lagrange multiplier	$t$ determines feasible region
When $\lambda = 0$	OLS solution	No constraint ($t = \infty$)
When $\lambda \to \infty$	$\boldsymbol{\beta} \to \mathbf{0}$	$t \to 0$

The L2 Ball — Geometric Perspective

To truly understand Ridge regression, we must visualize what's happening in the coefficient space. Consider a simple case with two coefficients, $\beta_1$ and $\beta_2$.

The L2 constraint region:

The constraint $|\boldsymbol{\beta}|_2^2 \leq t$ defines a ball (or disk in 2D) centered at the origin:

$$\beta_1^2 + \beta_2^2 \leq t$$

This is a circular region in 2D, a sphere in 3D, and a hypersphere in higher dimensions. The radius of this ball is $\sqrt{t}$.

Converting Mermaid diagram...

The RSS contours:

The residual sum of squares (RSS) without regularization defines elliptical contours in coefficient space. The OLS solution $\hat{\boldsymbol{\beta}}_{\text{OLS}}$ is the center of these ellipses—the point where RSS is minimized.

Finding the Ridge solution:

The Ridge solution is found where the RSS contour ellipse is tangent to the L2 constraint ball. This tangent point has several important properties:

It lies on or within the ball (satisfying the constraint)
It minimizes RSS among all points in the ball
It represents the optimal tradeoff between fit and complexity

Geometric insight:

As we shrink the ball (decrease $t$ or increase $\lambda$), the tangent point moves toward the origin. This is the shrinkage effect—coefficients are pulled toward zero, reducing their magnitudes while still fitting the data as well as possible given the constraint.

Why Shrinkage Works

The shrinkage interpretation reveals why Ridge regression combats overfitting. Large coefficients arise when the model exploits noise in the training data. By constraining coefficient magnitudes, we prevent this exploitation. The model must find patterns that work with smaller, more stable coefficients—patterns that tend to generalize better.

The smoothness of the L2 ball:

Unlike the L1 ball (which has corners—we'll see this with Lasso), the L2 ball is perfectly smooth. This smoothness has important implications:

No exact zeros: The tangent point almost never lies exactly on an axis. Thus, Ridge regression shrinks all coefficients but doesn't set any exactly to zero.
Continuous shrinkage: As $\lambda$ increases, coefficients decrease continuously rather than jumping.
Differentiability: The smooth boundary ensures the objective function is differentiable everywhere, enabling gradient-based optimization.

This is both a strength and weakness: Ridge is stable and well-behaved, but it doesn't perform automatic feature selection (all features remain in the model).

Historical Context — Tikhonov Regularization

The L2 penalty we use in Ridge regression has a rich history extending beyond statistics into the theory of ill-posed problems in mathematics and physics.

Origins in inverse problems:

In the 1940s-1960s, mathematicians encountered "ill-posed" problems—equations that either have no solution, multiple solutions, or solutions that are extremely sensitive to input perturbations. These arise frequently in physics (heat conduction, wave propagation) and engineering (signal reconstruction, image processing).

Andrey Tikhonov (Russian mathematician, 1906-1993) developed a systematic approach called Tikhonov regularization in 1963. The core idea: when solving $\mathbf{A}\mathbf{x} = \mathbf{b}$ where $\mathbf{A}$ is ill-conditioned, instead solve:

$$\min_{\mathbf{x}} \left{ |\mathbf{A}\mathbf{x} - \mathbf{b}|_2^2 + \lambda |\mathbf{x}|_2^2 \right}$$

This is precisely the Ridge regression objective.

Different Names for L2 Regularization
Field	Name	Context
Statistics	Ridge regression	Regression with correlated predictors
Mathematics	Tikhonov regularization	Solving ill-posed inverse problems
Machine Learning	L2 regularization	Neural network weight decay
Numerical Analysis	Damped least squares	Stabilizing matrix inversions
Geophysics	Occam inversion	Preferring simpler geological models

Ridge regression in statistics:

In statistics, Arthur Hoerl and Robert Kennard introduced "Ridge regression" in their seminal 1970 paper, addressing multicollinearity in regression. They showed that biased Ridge estimators often have lower mean squared error than unbiased OLS estimators—a breakthrough that influenced decades of statistical practice.

The name's origin:

The term "ridge" comes from the ridge added to the diagonal of $\mathbf{X}^T\mathbf{X}$. When we solve the normal equations with the L2 penalty:

$$(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}$$

The term $\lambda \mathbf{I}$ creates a "ridge" along the diagonal, lifting the eigenvalues and stabilizing the inversion. This ridge prevents the matrix from being singular—hence the name.

Unifying Perspective

The fact that L2 regularization appears independently across mathematics, statistics, physics, and machine learning—solving diverse problems from image reconstruction to neural network training—suggests something fundamental. The L2 penalty captures a universal principle: when data is noisy or limited, prefer simpler solutions.

Probabilistic Interpretation — Gaussian Prior

Ridge regression has an elegant interpretation in Bayesian statistics: the L2 penalty corresponds to placing a Gaussian (Normal) prior on the coefficients.

The Bayesian setup:

Consider the linear model with Gaussian noise:

$$y_i = \mathbf{x}_i^T \boldsymbol{\beta} + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)$$

In a Bayesian framework, we treat $\boldsymbol{\beta}$ as a random variable with a prior distribution. If we place independent Gaussian priors on each coefficient:

$$\beta_j \sim \mathcal{N}(0, \tau^2) \quad \text{for } j = 1, \ldots, p$$

The prior density is:

$$p(\boldsymbol{\beta}) \propto \exp\left( -\frac{1}{2\tau^2} \sum_{j=1}^{p} \beta_j^2 \right) = \exp\left( -\frac{1}{2\tau^2} |\boldsymbol{\beta}|_2^2 \right)$$

Maximum a Posteriori (MAP) estimation:

By Bayes' theorem, the posterior is proportional to the likelihood times the prior:

$$p(\boldsymbol{\beta} | \mathbf{y}) \propto p(\mathbf{y} | \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$

The log-posterior (ignoring constants) is:

$$\log p(\boldsymbol{\beta} | \mathbf{y}) = -\frac{1}{2\sigma^2} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 - \frac{1}{2\tau^2} |\boldsymbol{\beta}|_2^2 + \text{const}$$

Maximizing this is equivalent to minimizing:

$$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \frac{\sigma^2}{\tau^2} |\boldsymbol{\beta}|_2^2$$

Comparing with the Ridge objective, we identify:

$$\lambda = \frac{\sigma^2}{\tau^2}$$

The Connection Revealed

Ridge regression is exactly equivalent to Maximum A Posteriori (MAP) estimation with a Gaussian prior on coefficients. The regularization parameter $\lambda$ is the ratio of noise variance to prior variance. Strong regularization ($\lambda$ large) corresponds to a tight prior ($\tau^2$ small) that strongly pulls coefficients toward zero.

Implications of the Gaussian prior:

Centered at zero: The prior expects coefficients to be close to zero, encoding a belief that most features have small effects.
Finite variance: Unlike improper priors, the Gaussian prior is proper—it integrates to one. This ensures a proper posterior.
Smooth shrinkage: The Gaussian prior penalizes large coefficients quadratically, producing smooth shrinkage rather than hard thresholding.
Equal treatment: Independent, identical priors treat all coefficients symmetrically—no feature is a priori more important than another.

Connection to prior knowledge:

The Bayesian interpretation tells us that Ridge regression encodes a specific belief: before seeing data, we expect coefficients to be normally distributed around zero. If domain knowledge suggests otherwise (e.g., certain features should have larger effects), alternative priors might be more appropriate.

Bayesian Interpretation Summary
Component	Frequentist View	Bayesian View
$\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2$	Residual sum of squares	Negative log-likelihood
$\|\boldsymbol{\beta}\|_2^2$	Penalty term	Negative log-prior
$\lambda$	Regularization strength	Prior precision / noise variance
$\hat{\boldsymbol{\beta}}_{\text{Ridge}}$	Penalized least squares solution	MAP estimate

Feature Scaling — Critical Preprocessing

Before applying Ridge regression, feature scaling is essential. This is not merely a best practice—it's a mathematical necessity that fundamentally affects the solution.

Why scaling matters:

The L2 penalty $|\boldsymbol{\beta}|_2^2 = \sum_j \beta_j^2$ treats all coefficients equally. But coefficient magnitude depends on the scale of the corresponding feature:

If feature $X_1$ is measured in meters and $X_2$ in kilometers, $\beta_1$ will naturally be 1000× larger than $\beta_2$ for the same physical relationship.
The L2 penalty would then penalize $X_1$'s coefficient much more heavily than $X_2$'s—not because $X_1$ is less important, but because of an arbitrary unit choice.

This creates an inconsistency: the solution depends on measurement units, which is scientifically unacceptable.

Critical Warning

Applying Ridge regression to unscaled features produces mathematically valid but practically meaningless results. Features with larger scales will be under-penalized relative to their importance, leading to suboptimal shrinkage patterns. Always standardize features before Ridge regression.

Standard preprocessing:

The standard approach is to standardize each feature to have zero mean and unit variance:

$$\tilde{x}{ij} = \frac{x{ij} - \bar{x}_j}{s_j}$$

where $\bar{x}_j$ is the mean of feature $j$ and $s_j$ is its standard deviation.

After standardization:

Each feature has mean 0 and variance 1
All coefficients are measured on comparable scales
The L2 penalty applies equally to all features

Centering the response:

The response $\mathbf{y}$ should also be centered (mean subtracted) so that we can fit a model without an intercept. The intercept can be recovered as:

$$\hat{\beta}0 = \bar{y} - \sum{j=1}^{p} \bar{x}_j \hat{\beta}_j$$

where $\hat{\beta}_j$ are the Ridge coefficients fitted on centered, standardized data.

feature_scaling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
from sklearn.preprocessing import StandardScaler
 
def prepare_data_for_ridge(X, y):
    """
    Prepares data for Ridge regression with proper scaling.
    
    Parameters:
    -----------
    X : array-like of shape (n_samples, n_features)
        Feature matrix
    y : array-like of shape (n_samples,)
        Target vector
    
    Returns:
    --------
    X_scaled : ndarray
        Standardized features (mean=0, std=1)
    y_centered : ndarray
        Centered response
    scaler : StandardScaler
        Fitted scaler (retain for inverse transform)
    y_mean : float
        Mean of y (for intercept recovery)
    """
    # Standardize features: zero mean, unit variance
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Center response
    y_mean = np.mean(y)
    y_centered = y - y_mean
    
    return X_scaled, y_centered, scaler, y_mean
 
 
def recover_original_coefficients(beta_scaled, scaler, y_mean):
    """
    Converts coefficients from scaled space back to original feature space.
    
    Parameters:
    -----------
    beta_scaled : ndarray
        Ridge coefficients from scaled data (excluding intercept)
    scaler : StandardScaler
        The scaler used during preprocessing
    y_mean : float
        Original mean of y
    
    Returns:
    --------
    beta_original : ndarray
        Coefficients in original feature space
    intercept : float
        Recovered intercept term
    """
    # Scale factors: coefficients are divided by std of each feature
    beta_original = beta_scaled / scaler.scale_
    
    # Intercept: y_mean - sum(beta_original * x_mean)
    intercept = y_mean - np.dot(scaler.mean_, beta_original)
    
    return beta_original, intercept

Scaling Best Practices

•Always standardize features before Ridge regression to ensure equal penalty treatment.
•Center the response (subtract mean) to eliminate the intercept from regularization.
•Fit the scaler on training data only — then apply the same transformation to test data.
•Store scaling parameters to convert coefficients back to original units for interpretation.
•Consider other scaling methods (min-max, robust) if features have different distributional properties.

Summary: The L2 Penalty Foundation

We have established the theoretical foundations of the L2 penalty, understanding it from multiple complementary perspectives:

Key Takeaways

•The Ridge objective combines the OLS loss with an L2 penalty: $\min |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_2^2$
•Equivalent formulations — penalized and constrained forms offer different intuitions for the same solution.
•Geometric interpretation — Ridge finds the tangent point between RSS contours and the L2 ball, producing smooth shrinkage toward zero.
•Historical roots — L2 regularization (Tikhonov regularization) unifies approaches across statistics, mathematics, and engineering.
•Probabilistic interpretation — Ridge regression is MAP estimation with a Gaussian prior on coefficients.
•Feature scaling is essential — without standardization, the L2 penalty applies unequally across features.

What's next:

With the L2 penalty formulation established, we're ready to derive the closed-form solution for Ridge regression. We'll see that adding the regularization term makes the problem not just well-defined, but elegantly solvable—with guaranteed existence and uniqueness of the solution.

Page Complete

You now understand the L2 penalty from optimization, geometric, historical, and Bayesian perspectives. This multi-faceted understanding will serve you well as we develop Ridge regression into a complete, practical tool for regularized regression.

1 / 5

Loading learning content...

Machine LearningRegularization Theory

Ridge Regression (L2 Regularization)

LevelIntermediate

Duration90 mins

TopicRegularization Theory

1 / 5

L2 Penalty Formulation

The Regularization Imperative

The question now becomes: how do we tame this variance explosion?

What You Will Learn

The Problem That Demands Regularization

Before diving into the L2 penalty, let's crystallize exactly what problem we're solving. Recall the Ordinary Least Squares (OLS) objective:

$$\hat{\boldsymbol{\beta}}{\text{OLS}} = \arg\min{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2$$

When does OLS fail? OLS becomes problematic in several scenarios:

OLS Failure Modes

•High-dimensional data (p ≈ n or p > n): When the number of features p approaches or exceeds the number of observations n, the design matrix $\mathbf{X}^T\mathbf{X}$ becomes singular or ill-conditioned. The OLS solution either doesn't exist or is extremely unstable.
•Multicollinearity: When features are highly correlated, small changes in the data can cause massive swings in the estimated coefficients. The condition number of $\mathbf{X}^T\mathbf{X}$ becomes large, amplifying numerical errors.
•Overfitting: Even when a unique OLS solution exists, it may have enormous coefficients that perfectly interpolate training data but generalize poorly. The model captures noise rather than signal.
•Variance explosion: The variance of OLS estimates scales with $(\mathbf{X}^T\mathbf{X})^{-1}$. When this matrix is ill-conditioned, coefficient variances become prohibitively large.

A concrete example:

The insight: Large coefficients are a symptom of overfitting. If we can discourage large coefficients, we can reduce overfitting. This is precisely what the L2 penalty accomplishes.

The Core Intuition

The L2 Penalty — Mathematical Formulation

$$\text{L2 penalty} = |\boldsymbol{\beta}|2^2 = \sum{j=1}^{p} \beta_j^2$$

The complete Ridge regression objective combines the OLS loss with this penalty:

$$\hat{\boldsymbol{\beta}}{\text{Ridge}} = \arg\min{\boldsymbol{\beta}} \left{ |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda |\boldsymbol{\beta}|_2^2 \right}$$

Let's dissect each component of this formulation:

Components of the Ridge Regression Objective
Component	Mathematical Expression	Role
Loss term	$\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2$	Measures fit to training data (sum of squared residuals)
Penalty term	$\|\boldsymbol{\beta}\|_2^2$	Measures model complexity (sum of squared coefficients)
Regularization parameter	$\lambda \geq 0$	Controls tradeoff between fit and complexity
Coefficients	$\boldsymbol{\beta} = (\beta_1, \ldots, \beta_p)^T$	Parameters to be estimated

The regularization parameter $\lambda$:

The parameter $\lambda$ (also often denoted $\alpha$ in some libraries) is the regularization strength or hyperparameter. It controls how much we penalize large coefficients:

$\lambda = 0$: No penalty; Ridge reduces to OLS. We recover the unregularized solution.
$\lambda \to \infty$: Infinite penalty; all coefficients shrink to zero. The model becomes a constant prediction.
$\lambda > 0$ (finite): A balance between fitting the data and keeping coefficients small.

The challenge of choosing the right $\lambda$ is a central topic we'll address later in this module.

Intercept Term

Expanded form:

Writing out the objective explicitly:

$$J(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left( y_i - \sum_{j=1}^{p} x_{ij}\beta_j \right)^2 + \lambda \sum_{j=1}^{p} \beta_j^2$$

Constrained vs. Penalized Formulation

The Ridge objective can be expressed in two equivalent ways, connected by Lagrangian duality. Understanding both perspectives deepens our geometric intuition.

Penalized (Lagrangian) form:

$$\min_{\boldsymbol{\beta}} \left{ |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda |\boldsymbol{\beta}|_2^2 \right}$$

Constrained form:

$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 \quad \text{subject to} \quad |\boldsymbol{\beta}|_2^2 \leq t$$

Geometric Intuition

Why the L2 norm specifically?

The choice of the L2 (Euclidean) norm is not arbitrary. Consider properties we desire from a regularization penalty:

Differentiability everywhere: The L2 norm is smooth, allowing gradient-based optimization.
Isotropy: The penalty treats all directions in coefficient space equally.
Rotational invariance: The L2 norm is preserved under orthogonal transformations.
Strict convexity: When combined with the convex loss, guarantees a unique solution.
Computational tractability: Leads to a closed-form solution.

Alternative norms (L1, L∞, etc.) have different properties, leading to different regularization behaviors, which we'll explore in subsequent modules.

Equivalence Between Formulations
Aspect	Penalized Form	Constrained Form
Objective	Minimize loss + penalty	Minimize loss only
Constraint	None (unconstrained optimization)	$\|\boldsymbol{\beta}\|_2^2 \leq t$
Hyperparameter	$\lambda$ (penalty strength)	$t$ (constraint radius)
Relationship	$\lambda$ is Lagrange multiplier	$t$ determines feasible region
When $\lambda = 0$	OLS solution	No constraint ($t = \infty$)
When $\lambda \to \infty$	$\boldsymbol{\beta} \to \mathbf{0}$	$t \to 0$

The L2 Ball — Geometric Perspective

To truly understand Ridge regression, we must visualize what's happening in the coefficient space. Consider a simple case with two coefficients, $\beta_1$ and $\beta_2$.

The L2 constraint region:

The constraint $|\boldsymbol{\beta}|_2^2 \leq t$ defines a ball (or disk in 2D) centered at the origin:

$$\beta_1^2 + \beta_2^2 \leq t$$

This is a circular region in 2D, a sphere in 3D, and a hypersphere in higher dimensions. The radius of this ball is $\sqrt{t}$.

Converting Mermaid diagram...

The RSS contours:

Finding the Ridge solution:

The Ridge solution is found where the RSS contour ellipse is tangent to the L2 constraint ball. This tangent point has several important properties:

It lies on or within the ball (satisfying the constraint)
It minimizes RSS among all points in the ball
It represents the optimal tradeoff between fit and complexity

Geometric insight:

Why Shrinkage Works

The smoothness of the L2 ball:

Unlike the L1 ball (which has corners—we'll see this with Lasso), the L2 ball is perfectly smooth. This smoothness has important implications:

No exact zeros: The tangent point almost never lies exactly on an axis. Thus, Ridge regression shrinks all coefficients but doesn't set any exactly to zero.
Continuous shrinkage: As $\lambda$ increases, coefficients decrease continuously rather than jumping.
Differentiability: The smooth boundary ensures the objective function is differentiable everywhere, enabling gradient-based optimization.

This is both a strength and weakness: Ridge is stable and well-behaved, but it doesn't perform automatic feature selection (all features remain in the model).

Historical Context — Tikhonov Regularization

The L2 penalty we use in Ridge regression has a rich history extending beyond statistics into the theory of ill-posed problems in mathematics and physics.

Origins in inverse problems:

$$\min_{\mathbf{x}} \left{ |\mathbf{A}\mathbf{x} - \mathbf{b}|_2^2 + \lambda |\mathbf{x}|_2^2 \right}$$

This is precisely the Ridge regression objective.

Different Names for L2 Regularization
Field	Name	Context
Statistics	Ridge regression	Regression with correlated predictors
Mathematics	Tikhonov regularization	Solving ill-posed inverse problems
Machine Learning	L2 regularization	Neural network weight decay
Numerical Analysis	Damped least squares	Stabilizing matrix inversions
Geophysics	Occam inversion	Preferring simpler geological models

Ridge regression in statistics:

The name's origin:

The term "ridge" comes from the ridge added to the diagonal of $\mathbf{X}^T\mathbf{X}$. When we solve the normal equations with the L2 penalty:

$$(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}$$

The term $\lambda \mathbf{I}$ creates a "ridge" along the diagonal, lifting the eigenvalues and stabilizing the inversion. This ridge prevents the matrix from being singular—hence the name.

Unifying Perspective

Probabilistic Interpretation — Gaussian Prior

Ridge regression has an elegant interpretation in Bayesian statistics: the L2 penalty corresponds to placing a Gaussian (Normal) prior on the coefficients.

The Bayesian setup:

Consider the linear model with Gaussian noise:

$$y_i = \mathbf{x}_i^T \boldsymbol{\beta} + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)$$

In a Bayesian framework, we treat $\boldsymbol{\beta}$ as a random variable with a prior distribution. If we place independent Gaussian priors on each coefficient:

$$\beta_j \sim \mathcal{N}(0, \tau^2) \quad \text{for } j = 1, \ldots, p$$

The prior density is:

$$p(\boldsymbol{\beta}) \propto \exp\left( -\frac{1}{2\tau^2} \sum_{j=1}^{p} \beta_j^2 \right) = \exp\left( -\frac{1}{2\tau^2} |\boldsymbol{\beta}|_2^2 \right)$$

Maximum a Posteriori (MAP) estimation:

By Bayes' theorem, the posterior is proportional to the likelihood times the prior:

$$p(\boldsymbol{\beta} | \mathbf{y}) \propto p(\mathbf{y} | \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$

The log-posterior (ignoring constants) is:

$$\log p(\boldsymbol{\beta} | \mathbf{y}) = -\frac{1}{2\sigma^2} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 - \frac{1}{2\tau^2} |\boldsymbol{\beta}|_2^2 + \text{const}$$

Maximizing this is equivalent to minimizing:

$$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \frac{\sigma^2}{\tau^2} |\boldsymbol{\beta}|_2^2$$

Comparing with the Ridge objective, we identify:

$$\lambda = \frac{\sigma^2}{\tau^2}$$

The Connection Revealed

Implications of the Gaussian prior:

Centered at zero: The prior expects coefficients to be close to zero, encoding a belief that most features have small effects.
Finite variance: Unlike improper priors, the Gaussian prior is proper—it integrates to one. This ensures a proper posterior.
Smooth shrinkage: The Gaussian prior penalizes large coefficients quadratically, producing smooth shrinkage rather than hard thresholding.
Equal treatment: Independent, identical priors treat all coefficients symmetrically—no feature is a priori more important than another.

Connection to prior knowledge:

Bayesian Interpretation Summary
Component	Frequentist View	Bayesian View
$\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2$	Residual sum of squares	Negative log-likelihood
$\|\boldsymbol{\beta}\|_2^2$	Penalty term	Negative log-prior
$\lambda$	Regularization strength	Prior precision / noise variance
$\hat{\boldsymbol{\beta}}_{\text{Ridge}}$	Penalized least squares solution	MAP estimate

Feature Scaling — Critical Preprocessing

Before applying Ridge regression, feature scaling is essential. This is not merely a best practice—it's a mathematical necessity that fundamentally affects the solution.

Why scaling matters:

The L2 penalty $|\boldsymbol{\beta}|_2^2 = \sum_j \beta_j^2$ treats all coefficients equally. But coefficient magnitude depends on the scale of the corresponding feature:

If feature $X_1$ is measured in meters and $X_2$ in kilometers, $\beta_1$ will naturally be 1000× larger than $\beta_2$ for the same physical relationship.
The L2 penalty would then penalize $X_1$'s coefficient much more heavily than $X_2$'s—not because $X_1$ is less important, but because of an arbitrary unit choice.

This creates an inconsistency: the solution depends on measurement units, which is scientifically unacceptable.

Critical Warning

Standard preprocessing:

The standard approach is to standardize each feature to have zero mean and unit variance:

$$\tilde{x}{ij} = \frac{x{ij} - \bar{x}_j}{s_j}$$

where $\bar{x}_j$ is the mean of feature $j$ and $s_j$ is its standard deviation.

After standardization:

Each feature has mean 0 and variance 1
All coefficients are measured on comparable scales
The L2 penalty applies equally to all features

Centering the response:

The response $\mathbf{y}$ should also be centered (mean subtracted) so that we can fit a model without an intercept. The intercept can be recovered as:

$$\hat{\beta}0 = \bar{y} - \sum{j=1}^{p} \bar{x}_j \hat{\beta}_j$$

where $\hat{\beta}_j$ are the Ridge coefficients fitted on centered, standardized data.

feature_scaling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
from sklearn.preprocessing import StandardScaler
 
def prepare_data_for_ridge(X, y):
    """
    Prepares data for Ridge regression with proper scaling.
    
    Parameters:
    -----------
    X : array-like of shape (n_samples, n_features)
        Feature matrix
    y : array-like of shape (n_samples,)
        Target vector
    
    Returns:
    --------
    X_scaled : ndarray
        Standardized features (mean=0, std=1)
    y_centered : ndarray
        Centered response
    scaler : StandardScaler
        Fitted scaler (retain for inverse transform)
    y_mean : float
        Mean of y (for intercept recovery)
    """
    # Standardize features: zero mean, unit variance
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Center response
    y_mean = np.mean(y)
    y_centered = y - y_mean
    
    return X_scaled, y_centered, scaler, y_mean
 
 
def recover_original_coefficients(beta_scaled, scaler, y_mean):
    """
    Converts coefficients from scaled space back to original feature space.
    
    Parameters:
    -----------
    beta_scaled : ndarray
        Ridge coefficients from scaled data (excluding intercept)
    scaler : StandardScaler
        The scaler used during preprocessing
    y_mean : float
        Original mean of y
    
    Returns:
    --------
    beta_original : ndarray
        Coefficients in original feature space
    intercept : float
        Recovered intercept term
    """
    # Scale factors: coefficients are divided by std of each feature
    beta_original = beta_scaled / scaler.scale_
    
    # Intercept: y_mean - sum(beta_original * x_mean)
    intercept = y_mean - np.dot(scaler.mean_, beta_original)
    
    return beta_original, intercept

Scaling Best Practices

•Always standardize features before Ridge regression to ensure equal penalty treatment.
•Center the response (subtract mean) to eliminate the intercept from regularization.
•Fit the scaler on training data only — then apply the same transformation to test data.
•Store scaling parameters to convert coefficients back to original units for interpretation.
•Consider other scaling methods (min-max, robust) if features have different distributional properties.

Summary: The L2 Penalty Foundation

We have established the theoretical foundations of the L2 penalty, understanding it from multiple complementary perspectives:

Key Takeaways

•The Ridge objective combines the OLS loss with an L2 penalty: $\min |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_2^2$
•Equivalent formulations — penalized and constrained forms offer different intuitions for the same solution.
•Geometric interpretation — Ridge finds the tangent point between RSS contours and the L2 ball, producing smooth shrinkage toward zero.
•Historical roots — L2 regularization (Tikhonov regularization) unifies approaches across statistics, mathematics, and engineering.
•Probabilistic interpretation — Ridge regression is MAP estimation with a Gaussian prior on coefficients.
•Feature scaling is essential — without standardization, the L2 penalty applies unequally across features.

What's next:

Page Complete

1 / 5