Machine LearningMathematical Foundations: Linear Algebra

Norms and Distance Metrics

LevelIntermediate

Duration90 mins

TopicMathematical Foundations: Linear Algebra

5 / 5

Norm Regularization

Regularization: Controlling Model Complexity with Norms

Every machine learning practitioner faces a fundamental tension: models must be complex enough to capture patterns, but not so complex that they memorize noise. This is the essence of the bias-variance tradeoff, and regularization is our primary tool for navigating it.

Regularization works by adding a penalty term to the loss function that discourages overly complex models. The most common penalties are based on norms of the model parameters—the same norms we've studied throughout this module.

By penalizing the L1 or L2 norm of weights, we constrain the model's capacity, reduce overfitting, and often improve generalization. But the choice of norm has profound implications: L1 promotes sparsity (feature selection), L2 promotes small but dense weights, and combinations offer different tradeoffs.

What You Will Master

By the end of this page, you will understand the mathematical foundations of norm-based regularization, the geometric intuition for why L1 induces sparsity, the tradeoffs between L1 (Lasso), L2 (Ridge), and Elastic Net, practical guidelines for choosing regularization strength, and how these concepts extend to deep learning.

The Regularized Objective Function

In supervised learning, we minimize a loss function $L(\mathbf{w}; \mathcal{D})$ that measures how well our model with parameters $\mathbf{w}$ fits the training data $\mathcal{D}$. Regularization modifies this objective by adding a penalty on the parameters:

General Form:

$$\min_{\mathbf{w}} \underbrace{L(\mathbf{w}; \mathcal{D})}{\text{Data Fidelity}} + \underbrace{\lambda R(\mathbf{w})}{\text{Regularization Penalty}}$$

where:

$L(\mathbf{w}; \mathcal{D})$ is the empirical loss (e.g., MSE, cross-entropy)
$R(\mathbf{w})$ is the regularizer (often a norm or norm-squared)
$\lambda \geq 0$ is the regularization strength (hyperparameter)

The Tradeoff:

λ = 0: Pure empirical risk minimization; fits training data as well as possible; prone to overfitting
λ → ∞: Penalty dominates; weights shrink toward zero; severe underfitting
Optimal λ: Balances fit and complexity; found via cross-validation

Constraint Formulation (Equivalent View):

The penalized objective is mathematically equivalent to a constrained optimization:

$$\min_{\mathbf{w}} L(\mathbf{w}; \mathcal{D}) \quad \text{subject to} \quad R(\mathbf{w}) \leq t$$

For every $\lambda$, there exists a corresponding $t$ (via Lagrangian duality) where the solutions are identical. This constraint view provides geometric intuition: we're minimizing loss while staying within a 'budget' on model complexity.

Why Norm-Based Regularizers?

Norms are natural complexity measures: larger ||w|| means weights are 'bigger' and the model is potentially more complex. L1 and L2 norms are convex, preserving the convexity of many ML problems. They have well-understood optimization properties and interpretable effects on solutions.

Common Norm Regularizers
Name	Penalty $R(\mathbf{w})$	Effect
L2 / Ridge / Tikhonov	$\|\mathbf{w}\|_2^2 = \sum_i w_i^2$	Shrinks all weights; stable; closed-form
L1 / Lasso	$\|\mathbf{w}\|_1 = \sum_i \|w_i\|$	Drives some weights to exactly zero (sparsity)
Elastic Net	$\alpha\|\mathbf{w}\|_1 + (1-\alpha)\|\mathbf{w}\|_2^2$	Sparsity + grouping of correlated features
Weight Decay (DL)	$\frac{1}{2}\|\mathbf{W}\|_F^2$	Frobenius norm on weight matrices

L2 Regularization: Ridge Regression

L2 regularization (also called Ridge regression or Tikhonov regularization) adds the squared L2 norm of weights to the loss:

$$\min_{\mathbf{w}} \frac{1}{n}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda|\mathbf{w}|_2^2$$

Closed-Form Solution:

For linear regression, the regularized normal equations have a beautiful closed form:

$$\hat{\mathbf{w}}_{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$

The addition of $\lambda\mathbf{I}$ to the Gram matrix has profound effects:

Invertibility: Even if $\mathbf{X}^T\mathbf{X}$ is singular (more features than samples, or multicollinearity), $\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}$ is always invertible for $\lambda > 0$.
Condition Number: The condition number drops from $\sigma_{\max}^2/\sigma_{\min}^2$ to $(\sigma_{\max}^2 + \lambda)/(\sigma_{\min}^2 + \lambda)$, improving numerical stability.
No Sparsity: The solution typically has all weights non-zero; they're shrunk but not eliminated.

Geometric Interpretation:

The unit ball of the L2 norm is a sphere. When the level curves of the loss function (elliptical for least squares) meet this spherical constraint, the contact point is typically in the interior of each coordinate face—not at corners.

Result: Weights are shrunk proportionally toward zero, but rarely become exactly zero.

SVD Perspective:

If $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$ (SVD), then:

$$\hat{\mathbf{w}}_{\text{ridge}} = \mathbf{V}\text{diag}\left(\frac{\sigma_i}{\sigma_i^2 + \lambda}\right)\mathbf{U}^T\mathbf{y}$$

Each singular direction $\mathbf{v}_i$ gets shrunk by factor $\sigma_i/(\sigma_i^2 + \lambda)$. Small singular values (noisy directions) are shrunk more aggressively.

ridge_regression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt
 
# Generate data with collinearity
np.random.seed(42)
n, p = 100, 5
X = np.random.randn(n, p)
X[:, 1] = X[:, 0] + 0.01 * np.random.randn(n)  # x1 ≈ x0 (collinear)
true_w = np.array([1, 2, 0.5, -1, 0.3])
y = X @ true_w + 0.5 * np.random.randn(n)
 
print("Ridge Regression Analysis")
print("=" * 50)
 
# Solve with different lambda values
lambdas = [0, 0.01, 0.1, 1, 10, 100]
print(f"
{'Lambda':<10} {'||w||_2':<10} {'Weights...'}")
print("-" * 60)
 
for lam in lambdas:
    if lam == 0:
        # OLS solution
        w = np.linalg.lstsq(X, y, rcond=None)[0]
    else:
        # Ridge solution: (X'X + λI)^(-1) X'y
        XtX = X.T @ X
        I = np.eye(p)
        w = np.linalg.solve(XtX + lam * I, X.T @ y)
    
    norm_w = np.linalg.norm(w)
    print(f"{lam:<10} {norm_w:<10.4f} {np.round(w, 3)}")
 
# Demonstrate improved conditioning
print("
--- Effect on Condition Number ---")
XtX = X.T @ X
print(f"κ(X'X) = {np.linalg.cond(XtX):.2f} (ill-conditioned due to collinearity)")
 
for lam in [0.01, 0.1, 1]:
    cond = np.linalg.cond(XtX + lam * np.eye(p))
    print(f"κ(X'X + {lam}I) = {cond:.2f}")
 
# Using sklearn
model = Ridge(alpha=1.0)
model.fit(X, y)
print(f"
sklearn Ridge (α=1): {np.round(model.coef_, 3)}")
print(f"True weights: {true_w}")

Bayesian Interpretation

L2 regularization corresponds to a Gaussian prior on weights: $p(\mathbf{w}) \propto \exp(-\lambda|\mathbf{w}|_2^2)$. Ridge regression is equivalent to MAP estimation with this prior. The regularization strength λ is inversely proportional to the prior variance.

L1 Regularization: Lasso and Sparsity

L1 regularization (also called Lasso - Least Absolute Shrinkage and Selection Operator) adds the L1 norm of weights to the loss:

$$\min_{\mathbf{w}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda|\mathbf{w}|_1$$

The defining property of Lasso is sparsity: it tends to set some coefficients to exactly zero, effectively performing feature selection.

Why L1 Induces Sparsity (Geometric Intuition):

This is one of the most important insights in regularization theory:

The L1 ball ($|\mathbf{w}|_1 \leq t$) has corners on the coordinate axes
The loss function's level curves are typically elliptical (for squared error)
When these ellipses first touch the constraint region as we shrink the feasible set, they're most likely to touch at a corner
At a corner, one or more coordinates are exactly zero

Contrast with L2:

The L2 ball is smooth everywhere (a sphere)
Contact with the sphere typically occurs at arbitrary points, not on coordinate axes
Hence, L2 shrinks but doesn't zero out coefficients

Advantages of Lasso

•Feature Selection: Automatically selects relevant features by zeroing others
•Interpretability: Sparse models are easier to understand
•Reduced Complexity: Fewer features mean simpler deployment
•High-Dimensional Data: Works even when p > n (more features than samples)

Limitations of Lasso

•No Closed Form: Requires iterative algorithms (coordinate descent, ADMM)
•Selection Instability: With correlated features, may arbitrarily select one and discard the other
•At Most n Features: Can select at most n features when p > n
•Biased Estimates: Large coefficients are over-shrunk

Soft Thresholding:

For simple settings (orthonormal design), Lasso has a closed-form solution involving the soft thresholding operator:

$$\hat{w}j = S\lambda(\hat{w}_j^{\text{OLS}}) = \text{sign}(\hat{w}_j^{\text{OLS}}) \cdot \max(|\hat{w}_j^{\text{OLS}}| - \lambda, 0)$$

This shrinks coefficients toward zero and sets them exactly to zero if they're smaller than $\lambda$ in absolute value.

lasso_sparsity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
from sklearn.linear_model import Lasso, Ridge
 
# Generate sparse ground truth
np.random.seed(42)
n, p = 100, 20
X = np.random.randn(n, p)
 
# Only 5 features are truly relevant
true_w = np.zeros(p)
true_w[:5] = [3, -2, 1.5, -1, 0.5]  # First 5 features matter
 
y = X @ true_w + 0.5 * np.random.randn(n)
 
print("Lasso vs Ridge: Sparsity Comparison")
print("=" * 60)
print(f"True weights (first 10): {true_w[:10]}")
print(f"Number of true non-zero weights: {np.sum(true_w != 0)}
")
 
# Lasso (L1)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
 
# Ridge (L2)
ridge = Ridge(alpha=0.1)
ridge.fit(X, y)
 
print(f"{'Method':<10} {'Non-zeros':<12} {'||w||_1':<10} {'||w||_2':<10}")
print("-" * 55)
print(f"{'Lasso':<10} {np.sum(np.abs(lasso.coef_) > 1e-6):<12} "
      f"{np.linalg.norm(lasso.coef_, 1):<10.4f} "
      f"{np.linalg.norm(lasso.coef_, 2):<10.4f}")
print(f"{'Ridge':<10} {np.sum(np.abs(ridge.coef_) > 1e-6):<12} "
      f"{np.linalg.norm(ridge.coef_, 1):<10.4f} "
      f"{np.linalg.norm(ridge.coef_, 2):<10.4f}")
 
print(f"
Lasso coefficients (first 10): {np.round(lasso.coef_[:10], 3)}")
print(f"Ridge coefficients (first 10): {np.round(ridge.coef_[:10], 3)}")
 
# Regularization path
print("
--- Lasso Regularization Path ---")
alphas = [0.01, 0.05, 0.1, 0.5, 1.0]
print(f"{'Alpha':<10} {'Non-zeros':<12} {'Selected features'}")
for alpha in alphas:
    lasso = Lasso(alpha=alpha)
    lasso.fit(X, y)
    nonzeros = np.where(np.abs(lasso.coef_) > 1e-6)[0]
    print(f"{alpha:<10} {len(nonzeros):<12} {list(nonzeros)}")

Bayesian Interpretation of L1

L1 regularization corresponds to a Laplace (double exponential) prior on weights: $p(w) \propto \exp(-\lambda|w|)$. The Laplace distribution has heavier tails than Gaussian, allowing large weights, but a sharp peak at zero, encouraging sparsity. Lasso is MAP estimation with this prior.

Elastic Net: Best of Both Worlds

Elastic Net combines L1 and L2 regularization, addressing weaknesses of both:

$$\min_{\mathbf{w}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda\left(\alpha|\mathbf{w}|_1 + (1-\alpha)|\mathbf{w}|_2^2\right)$$

Alternatively written as:

$$\min_{\mathbf{w}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda_1|\mathbf{w}|_1 + \lambda_2|\mathbf{w}|_2^2$$

where $\alpha \in [0, 1]$ controls the mix:

$\alpha = 1$: Pure Lasso (L1)
$\alpha = 0$: Pure Ridge (L2)
$\alpha \in (0, 1)$: Elastic Net blend

Why Elastic Net?

•Grouping Effect: When features are highly correlated, Lasso tends to select one and ignore the others arbitrarily. Elastic Net tends to select or exclude correlated features together.
•Stability: The L2 component stabilizes the solution, making it less sensitive to small data perturbations.
•More Than n Features: Unlike pure Lasso in the p > n case, Elastic Net can select more than n features.
•Strictly Convex: The L2 component makes the objective strictly convex (Lasso is only convex), ensuring a unique solution.

Geometric Intuition:

The Elastic Net constraint region is between the L1 diamond and L2 sphere—a 'rounded diamond.' It has the corners of L1 (enabling sparsity) but with smoothed edges from L2 (providing stability).

elastic_net.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from sklearn.linear_model import ElasticNet, Lasso, Ridge
 
# Generate data with correlated features
np.random.seed(42)
n, p = 100, 10
base = np.random.randn(n)
X = np.zeros((n, p))
 
# Create groups of correlated features
X[:, 0] = base + 0.1 * np.random.randn(n)
X[:, 1] = base + 0.1 * np.random.randn(n)  # Correlated with X[:, 0]
X[:, 2] = base + 0.1 * np.random.randn(n)  # Correlated with X[:, 0], X[:, 1]
X[:, 3:] = np.random.randn(n, p - 3)  # Independent features
 
# True weights: all correlated features matter equally
true_w = np.array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0])
y = X @ true_w + 0.5 * np.random.randn(n)
 
print("Elastic Net: Handling Correlated Features")
print("=" * 60)
print(f"Features 0, 1, 2 are correlated (all have true weight 1)")
print(f"True weights: {true_w}
")
 
# Compare methods
alpha = 0.1
 
lasso = Lasso(alpha=alpha)
ridge = Ridge(alpha=alpha)
elastic = ElasticNet(alpha=alpha, l1_ratio=0.5)
 
lasso.fit(X, y)
ridge.fit(X, y)
elastic.fit(X, y)
 
print(f"{'Method':<15} {'w[0]':<8} {'w[1]':<8} {'w[2]':<8} {'Non-zeros':<10}")
print("-" * 60)
print(f"{'Lasso':<15} {lasso.coef_[0]:<8.3f} {lasso.coef_[1]:<8.3f} "
      f"{lasso.coef_[2]:<8.3f} {np.sum(np.abs(lasso.coef_) > 0.01)}")
print(f"{'Ridge':<15} {ridge.coef_[0]:<8.3f} {ridge.coef_[1]:<8.3f} "
      f"{ridge.coef_[2]:<8.3f} {np.sum(np.abs(ridge.coef_) > 0.01)}")
print(f"{'Elastic Net':<15} {elastic.coef_[0]:<8.3f} {elastic.coef_[1]:<8.3f} "
      f"{elastic.coef_[2]:<8.3f} {np.sum(np.abs(elastic.coef_) > 0.01)}")
 
print("
Notice: Lasso may arbitrarily pick one of the correlated features,")
print("while Ridge and Elastic Net spread weight across all three.")
 
# Vary l1_ratio
print("
--- Effect of l1_ratio (α) ---")
for l1_ratio in [0.1, 0.3, 0.5, 0.7, 0.9]:
    en = ElasticNet(alpha=0.1, l1_ratio=l1_ratio)
    en.fit(X, y)
    nonzeros = np.sum(np.abs(en.coef_) > 0.01)
    print(f"l1_ratio={l1_ratio}: {nonzeros} non-zeros, coef[:3] = {np.round(en.coef_[:3], 3)}")

Choosing Regularization Strength

The regularization strength $\lambda$ (or $\alpha$ in scikit-learn) is a critical hyperparameter. Too small, and regularization has no effect; too large, and the model underfits.

Cross-Validation:

The standard approach is to select $\lambda$ via cross-validation:

Choose a grid of candidate $\lambda$ values (typically logarithmically spaced)
For each $\lambda$, evaluate K-fold cross-validation performance
Select the $\lambda$ with best average validation performance

Common Strategies:

Lambda Selection Strategies
Strategy	Description	When to Use
$\lambda_{\min}$	λ with lowest CV error	Maximum predictive accuracy
$\lambda_{1se}$	Largest λ within 1 SE of minimum	Simpler model, similar performance; prefer for interpretability
AIC/BIC	Information criteria (penalize complexity)	Model selection theory; Bayesian justification
Stability selection	Features selected across multiple subsamples	Robust feature selection

lambda_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV
from sklearn.model_selection import cross_val_score
 
# Generate data
np.random.seed(42)
n, p = 200, 50
X = np.random.randn(n, p)
true_w = np.zeros(p)
true_w[:10] = np.random.randn(10)
y = X @ true_w + 0.5 * np.random.randn(n)
 
print("Automatic Lambda Selection with Cross-Validation")
print("=" * 60)
 
# LassoCV automatically finds optimal lambda
lasso_cv = LassoCV(cv=5, alphas=np.logspace(-4, 1, 50), random_state=42)
lasso_cv.fit(X, y)
 
print(f"
Lasso with CV:")
print(f"  Best α: {lasso_cv.alpha_:.6f}")
print(f"  Non-zero coefficients: {np.sum(np.abs(lasso_cv.coef_) > 1e-6)}")
print(f"  True non-zero: {np.sum(true_w != 0)}")
 
# Ridge CV
ridge_cv = RidgeCV(cv=5, alphas=np.logspace(-4, 4, 50))
ridge_cv.fit(X, y)
print(f"
Ridge with CV:")
print(f"  Best α: {ridge_cv.alpha_:.6f}")
 
# Elastic Net CV
enet_cv = ElasticNetCV(cv=5, l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95], 
                        alphas=np.logspace(-4, 1, 50), random_state=42)
enet_cv.fit(X, y)
print(f"
Elastic Net with CV:")
print(f"  Best α: {enet_cv.alpha_:.6f}")
print(f"  Best l1_ratio: {enet_cv.l1_ratio_:.2f}")
print(f"  Non-zero coefficients: {np.sum(np.abs(enet_cv.coef_) > 1e-6)}")
 
# Regularization path
print("
--- Regularization Path (Lasso) ---")
print(f"{'Alpha':<12} {'Non-zeros':<12} {'CV MSE (mean)':<15}")
print("-" * 45)
for alpha, mse_path in zip(lasso_cv.alphas_[::10], lasso_cv.mse_path_[::10]):
    mean_mse = np.mean(mse_path)
    n_nonzero = np.sum(np.abs(lasso_cv.coef_) > 1e-6) if alpha == lasso_cv.alpha_ else "—"
    print(f"{alpha:<12.6f} {str(n_nonzero):<12} {mean_mse:<15.4f}")

Practical Guidelines

Start with a wide logarithmic range for λ (e.g., 10⁻⁴ to 10²). For Elastic Net, also search over l1_ratio (e.g., 0.1, 0.5, 0.7, 0.9, 0.99). Use at least 5-fold CV. Prefer λ_1se over λ_min if interpretability/simplicity matters. Always standardize features before regularization to make λ comparable across features.

Regularization in Deep Learning

Norm-based regularization extends to deep learning, though the landscape is more complex due to the non-convex loss surfaces and the unique properties of neural networks.

Weight Decay (L2):

In deep learning, L2 regularization is typically implemented as weight decay—adding a fraction of the weights to the gradient update:

$$\mathbf{w}_{t+1} = \mathbf{w}_t - \eta( abla L + \lambda \mathbf{w}_t) = (1 - \eta\lambda)\mathbf{w}_t - \eta abla L$$

The $(1 - \eta\lambda)$ factor 'decays' weights toward zero each step.

Note: With optimizers like Adam, weight decay and L2 regularization are NOT equivalent! Use separate weight decay (AdamW) for proper behavior.

Regularization Techniques in Deep Learning

•Weight Decay (L2): Standard norm penalty on weight matrices; prevents weights from growing too large
•L1 Regularization: Less common in DL; can induce sparse networks (pruning)
•Spectral Normalization: Constrains the spectral norm $|\mathbf{W}|_2$; stabilizes GANs
•Max-Norm Constraint: $|\mathbf{w}|_2 \leq c$ per unit; prevents explosive weights
•Dropout: Implicit regularization via random zeroing; related to L2 for linear models
•Batch Normalization: Reduces internal covariate shift; has regularizing effect
•Data Augmentation: Regularization via expanded training distribution

weight_decay_pytorch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import torch
import torch.nn as nn
 
# Example: Adding weight decay in PyTorch
 
# Method 1: In optimizer (applies to all parameters)
model = nn.Sequential(
    nn.Linear(100, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)
 
optimizer_with_wd = torch.optim.SGD(
    model.parameters(), 
    lr=0.01, 
    weight_decay=0.01  # L2 regularization strength
)
 
# Method 2: AdamW for proper weight decay (decoupled from adaptive learning rate)
optimizer_adamw = torch.optim.AdamW(
    model.parameters(),
    lr=0.001,
    weight_decay=0.01  # Proper decoupled weight decay
)
 
# Method 3: Different decay for different parameter groups
optimizer_custom = torch.optim.AdamW([
    {'params': model[0].parameters(), 'weight_decay': 0.01},  # First layer
    {'params': model[2].parameters(), 'weight_decay': 0.001}  # Second layer (less decay)
], lr=0.001)
 
# Manual L2 regularization (equivalent to weight_decay in SGD)
def train_step_manual_l2(model, optimizer, x, y, l2_lambda=0.01):
    optimizer.zero_grad()
    output = model(x)
    loss = nn.functional.cross_entropy(output, y)
    
    # Add L2 penalty manually
    l2_reg = sum(param.pow(2).sum() for param in model.parameters())
    loss = loss + l2_lambda * l2_reg
    
    loss.backward()
    optimizer.step()
    return loss.item()
 
print("Weight decay in PyTorch:")
print("- Use weight_decay parameter in optimizer")
print("- Prefer AdamW over Adam + L2 for proper decoupled weight decay")
print("- Can specify different decay for different parameter groups")
print("- Typically weight_decay ~ 0.01 to 0.1 for transformers")

Adam vs AdamW

In Adam optimizer, weight decay and L2 regularization are NOT equivalent due to adaptive learning rates. L2 regularization gradient is scaled by Adam's per-parameter learning rate, diluting the effect. AdamW fixes this by applying weight decay directly to weights, not through the gradient. Always prefer AdamW for consistent regularization.

Practical Guidelines for Regularization

After understanding the theory, here are actionable guidelines for applying regularization in practice.

When to Use Which Regularizer

•Use L2 (Ridge) when: You want to shrink all coefficients; features are not obviously irrelevant; you need numerical stability; you want a closed-form solution; multicollinearity is present.
•Use L1 (Lasso) when: You believe many features are irrelevant; interpretability via feature selection matters; you want a sparse model for deployment efficiency; p >> n (high-dimensional, few samples).
•Use Elastic Net when: Features are correlated and you want to group them; Lasso's instability is a problem; you want sparsity but also stability; p >> n with correlated features.
•Use no regularization when: You have abundant data relative to model complexity; you're using other regularization (dropout, augmentation); validation shows no overfitting.

Regularization Hyperparameter Typical Ranges
Method	Parameter	Typical Range	Notes
Ridge	λ (alpha)	10⁻⁴ to 10⁴	Wide range; scale depends on data
Lasso	λ (alpha)	10⁻⁴ to 10¹	Larger values → more zeros
Elastic Net	λ (alpha)	10⁻⁴ to 10¹	Combined overall strength
Elastic Net	l1_ratio	0.1 to 0.99	Higher → more sparsity
Deep Learning	weight_decay	10⁻⁶ to 10⁻¹	0.01-0.1 common for transformers

Essential Preprocessing

•Always standardize features before regularization. Otherwise, features measured in larger units are unfairly penalized less.
•Center the target for regression (or use regularization only on non-intercept terms).
•Handle outliers before regularization; L2 is sensitive to outliers in X.
•Consider feature engineering before relying solely on regularization for selection.

The Regularization Mindset

Think of regularization as expressing a prior belief: 'All else equal, I prefer simpler models with smaller/sparser weights.' The regularization strength λ quantifies how strongly you hold this belief. Cross-validation lets the data tell you the appropriate strength.

Summary: Harnessing Norms for Generalization

Key Takeaways

•Regularization adds a norm-based penalty to the loss function, trading off data fit against model complexity.
•L2 (Ridge) shrinks all coefficients proportionally toward zero, has a closed-form solution, and is numerically stable.
•L1 (Lasso) drives some coefficients to exactly zero, performing automatic feature selection via the geometry of its corner-shaped constraint set.
•Elastic Net combines L1 and L2, gaining sparsity from L1 and stability/grouping from L2—ideal for correlated features.
•The regularization strength λ should be chosen via cross-validation; λ_1se often gives simpler, more robust models than λ_min.
•In deep learning, use weight decay (preferably AdamW) for proper L2 regularization; complement with dropout, batch norm, and data augmentation.
•Always standardize features before regularization to ensure fair penalization across different scales.

Module Complete:

This concludes our module on Norms and Distance Metrics. We've journeyed from the abstract axioms of norms through their concrete applications in measuring vectors, matrices, distances, similarities, and finally controlling model complexity through regularization.

These concepts form a mathematical foundation that appears throughout machine learning—in loss functions, in optimization algorithms, in generalization theory, and in practical model tuning. Mastery of these ideas equips you to understand why algorithms work and how to make them work better.

Module 5 Complete: Norms and Distance Metrics

You've mastered the theory and practice of norms, distances, similarities, and regularization. These tools for measuring 'size' and 'difference' are fundamental to virtually every ML algorithm—from linear regression to deep neural networks to clustering to retrieval systems. You're now equipped to choose appropriate metrics for your problems and apply regularization effectively.

5 / 5

Loading learning content...

Machine LearningMathematical Foundations: Linear Algebra

Norms and Distance Metrics

LevelIntermediate

Duration90 mins

TopicMathematical Foundations: Linear Algebra

5 / 5

Norm Regularization

Regularization: Controlling Model Complexity with Norms

What You Will Master

The Regularized Objective Function

General Form:

$$\min_{\mathbf{w}} \underbrace{L(\mathbf{w}; \mathcal{D})}{\text{Data Fidelity}} + \underbrace{\lambda R(\mathbf{w})}{\text{Regularization Penalty}}$$

where:

$L(\mathbf{w}; \mathcal{D})$ is the empirical loss (e.g., MSE, cross-entropy)
$R(\mathbf{w})$ is the regularizer (often a norm or norm-squared)
$\lambda \geq 0$ is the regularization strength (hyperparameter)

The Tradeoff:

λ = 0: Pure empirical risk minimization; fits training data as well as possible; prone to overfitting
λ → ∞: Penalty dominates; weights shrink toward zero; severe underfitting
Optimal λ: Balances fit and complexity; found via cross-validation

Constraint Formulation (Equivalent View):

The penalized objective is mathematically equivalent to a constrained optimization:

$$\min_{\mathbf{w}} L(\mathbf{w}; \mathcal{D}) \quad \text{subject to} \quad R(\mathbf{w}) \leq t$$

Why Norm-Based Regularizers?

Common Norm Regularizers
Name	Penalty $R(\mathbf{w})$	Effect
L2 / Ridge / Tikhonov	$\|\mathbf{w}\|_2^2 = \sum_i w_i^2$	Shrinks all weights; stable; closed-form
L1 / Lasso	$\|\mathbf{w}\|_1 = \sum_i \|w_i\|$	Drives some weights to exactly zero (sparsity)
Elastic Net	$\alpha\|\mathbf{w}\|_1 + (1-\alpha)\|\mathbf{w}\|_2^2$	Sparsity + grouping of correlated features
Weight Decay (DL)	$\frac{1}{2}\|\mathbf{W}\|_F^2$	Frobenius norm on weight matrices

L2 Regularization: Ridge Regression

L2 regularization (also called Ridge regression or Tikhonov regularization) adds the squared L2 norm of weights to the loss:

$$\min_{\mathbf{w}} \frac{1}{n}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda|\mathbf{w}|_2^2$$

Closed-Form Solution:

For linear regression, the regularized normal equations have a beautiful closed form:

$$\hat{\mathbf{w}}_{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$

The addition of $\lambda\mathbf{I}$ to the Gram matrix has profound effects:

Invertibility: Even if $\mathbf{X}^T\mathbf{X}$ is singular (more features than samples, or multicollinearity), $\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}$ is always invertible for $\lambda > 0$.
Condition Number: The condition number drops from $\sigma_{\max}^2/\sigma_{\min}^2$ to $(\sigma_{\max}^2 + \lambda)/(\sigma_{\min}^2 + \lambda)$, improving numerical stability.
No Sparsity: The solution typically has all weights non-zero; they're shrunk but not eliminated.

Geometric Interpretation:

Result: Weights are shrunk proportionally toward zero, but rarely become exactly zero.

SVD Perspective:

If $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$ (SVD), then:

$$\hat{\mathbf{w}}_{\text{ridge}} = \mathbf{V}\text{diag}\left(\frac{\sigma_i}{\sigma_i^2 + \lambda}\right)\mathbf{U}^T\mathbf{y}$$

Each singular direction $\mathbf{v}_i$ gets shrunk by factor $\sigma_i/(\sigma_i^2 + \lambda)$. Small singular values (noisy directions) are shrunk more aggressively.

ridge_regression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt
 
# Generate data with collinearity
np.random.seed(42)
n, p = 100, 5
X = np.random.randn(n, p)
X[:, 1] = X[:, 0] + 0.01 * np.random.randn(n)  # x1 ≈ x0 (collinear)
true_w = np.array([1, 2, 0.5, -1, 0.3])
y = X @ true_w + 0.5 * np.random.randn(n)
 
print("Ridge Regression Analysis")
print("=" * 50)
 
# Solve with different lambda values
lambdas = [0, 0.01, 0.1, 1, 10, 100]
print(f"
{'Lambda':<10} {'||w||_2':<10} {'Weights...'}")
print("-" * 60)
 
for lam in lambdas:
    if lam == 0:
        # OLS solution
        w = np.linalg.lstsq(X, y, rcond=None)[0]
    else:
        # Ridge solution: (X'X + λI)^(-1) X'y
        XtX = X.T @ X
        I = np.eye(p)
        w = np.linalg.solve(XtX + lam * I, X.T @ y)
    
    norm_w = np.linalg.norm(w)
    print(f"{lam:<10} {norm_w:<10.4f} {np.round(w, 3)}")
 
# Demonstrate improved conditioning
print("
--- Effect on Condition Number ---")
XtX = X.T @ X
print(f"κ(X'X) = {np.linalg.cond(XtX):.2f} (ill-conditioned due to collinearity)")
 
for lam in [0.01, 0.1, 1]:
    cond = np.linalg.cond(XtX + lam * np.eye(p))
    print(f"κ(X'X + {lam}I) = {cond:.2f}")
 
# Using sklearn
model = Ridge(alpha=1.0)
model.fit(X, y)
print(f"
sklearn Ridge (α=1): {np.round(model.coef_, 3)}")
print(f"True weights: {true_w}")

Bayesian Interpretation

L1 Regularization: Lasso and Sparsity

L1 regularization (also called Lasso - Least Absolute Shrinkage and Selection Operator) adds the L1 norm of weights to the loss:

$$\min_{\mathbf{w}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda|\mathbf{w}|_1$$

The defining property of Lasso is sparsity: it tends to set some coefficients to exactly zero, effectively performing feature selection.

Why L1 Induces Sparsity (Geometric Intuition):

This is one of the most important insights in regularization theory:

The L1 ball ($|\mathbf{w}|_1 \leq t$) has corners on the coordinate axes
The loss function's level curves are typically elliptical (for squared error)
When these ellipses first touch the constraint region as we shrink the feasible set, they're most likely to touch at a corner
At a corner, one or more coordinates are exactly zero

Contrast with L2:

The L2 ball is smooth everywhere (a sphere)
Contact with the sphere typically occurs at arbitrary points, not on coordinate axes
Hence, L2 shrinks but doesn't zero out coefficients

Advantages of Lasso

•Feature Selection: Automatically selects relevant features by zeroing others
•Interpretability: Sparse models are easier to understand
•Reduced Complexity: Fewer features mean simpler deployment
•High-Dimensional Data: Works even when p > n (more features than samples)

Limitations of Lasso

•No Closed Form: Requires iterative algorithms (coordinate descent, ADMM)
•Selection Instability: With correlated features, may arbitrarily select one and discard the other
•At Most n Features: Can select at most n features when p > n
•Biased Estimates: Large coefficients are over-shrunk

Soft Thresholding:

For simple settings (orthonormal design), Lasso has a closed-form solution involving the soft thresholding operator:

$$\hat{w}j = S\lambda(\hat{w}_j^{\text{OLS}}) = \text{sign}(\hat{w}_j^{\text{OLS}}) \cdot \max(|\hat{w}_j^{\text{OLS}}| - \lambda, 0)$$

This shrinks coefficients toward zero and sets them exactly to zero if they're smaller than $\lambda$ in absolute value.

lasso_sparsity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
from sklearn.linear_model import Lasso, Ridge
 
# Generate sparse ground truth
np.random.seed(42)
n, p = 100, 20
X = np.random.randn(n, p)
 
# Only 5 features are truly relevant
true_w = np.zeros(p)
true_w[:5] = [3, -2, 1.5, -1, 0.5]  # First 5 features matter
 
y = X @ true_w + 0.5 * np.random.randn(n)
 
print("Lasso vs Ridge: Sparsity Comparison")
print("=" * 60)
print(f"True weights (first 10): {true_w[:10]}")
print(f"Number of true non-zero weights: {np.sum(true_w != 0)}
")
 
# Lasso (L1)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
 
# Ridge (L2)
ridge = Ridge(alpha=0.1)
ridge.fit(X, y)
 
print(f"{'Method':<10} {'Non-zeros':<12} {'||w||_1':<10} {'||w||_2':<10}")
print("-" * 55)
print(f"{'Lasso':<10} {np.sum(np.abs(lasso.coef_) > 1e-6):<12} "
      f"{np.linalg.norm(lasso.coef_, 1):<10.4f} "
      f"{np.linalg.norm(lasso.coef_, 2):<10.4f}")
print(f"{'Ridge':<10} {np.sum(np.abs(ridge.coef_) > 1e-6):<12} "
      f"{np.linalg.norm(ridge.coef_, 1):<10.4f} "
      f"{np.linalg.norm(ridge.coef_, 2):<10.4f}")
 
print(f"
Lasso coefficients (first 10): {np.round(lasso.coef_[:10], 3)}")
print(f"Ridge coefficients (first 10): {np.round(ridge.coef_[:10], 3)}")
 
# Regularization path
print("
--- Lasso Regularization Path ---")
alphas = [0.01, 0.05, 0.1, 0.5, 1.0]
print(f"{'Alpha':<10} {'Non-zeros':<12} {'Selected features'}")
for alpha in alphas:
    lasso = Lasso(alpha=alpha)
    lasso.fit(X, y)
    nonzeros = np.where(np.abs(lasso.coef_) > 1e-6)[0]
    print(f"{alpha:<10} {len(nonzeros):<12} {list(nonzeros)}")

Bayesian Interpretation of L1

Elastic Net: Best of Both Worlds

Elastic Net combines L1 and L2 regularization, addressing weaknesses of both:

$$\min_{\mathbf{w}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda\left(\alpha|\mathbf{w}|_1 + (1-\alpha)|\mathbf{w}|_2^2\right)$$

Alternatively written as:

$$\min_{\mathbf{w}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda_1|\mathbf{w}|_1 + \lambda_2|\mathbf{w}|_2^2$$

where $\alpha \in [0, 1]$ controls the mix:

$\alpha = 1$: Pure Lasso (L1)
$\alpha = 0$: Pure Ridge (L2)
$\alpha \in (0, 1)$: Elastic Net blend

Why Elastic Net?

•Grouping Effect: When features are highly correlated, Lasso tends to select one and ignore the others arbitrarily. Elastic Net tends to select or exclude correlated features together.
•Stability: The L2 component stabilizes the solution, making it less sensitive to small data perturbations.
•More Than n Features: Unlike pure Lasso in the p > n case, Elastic Net can select more than n features.
•Strictly Convex: The L2 component makes the objective strictly convex (Lasso is only convex), ensuring a unique solution.

Geometric Intuition:

The Elastic Net constraint region is between the L1 diamond and L2 sphere—a 'rounded diamond.' It has the corners of L1 (enabling sparsity) but with smoothed edges from L2 (providing stability).

elastic_net.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from sklearn.linear_model import ElasticNet, Lasso, Ridge
 
# Generate data with correlated features
np.random.seed(42)
n, p = 100, 10
base = np.random.randn(n)
X = np.zeros((n, p))
 
# Create groups of correlated features
X[:, 0] = base + 0.1 * np.random.randn(n)
X[:, 1] = base + 0.1 * np.random.randn(n)  # Correlated with X[:, 0]
X[:, 2] = base + 0.1 * np.random.randn(n)  # Correlated with X[:, 0], X[:, 1]
X[:, 3:] = np.random.randn(n, p - 3)  # Independent features
 
# True weights: all correlated features matter equally
true_w = np.array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0])
y = X @ true_w + 0.5 * np.random.randn(n)
 
print("Elastic Net: Handling Correlated Features")
print("=" * 60)
print(f"Features 0, 1, 2 are correlated (all have true weight 1)")
print(f"True weights: {true_w}
")
 
# Compare methods
alpha = 0.1
 
lasso = Lasso(alpha=alpha)
ridge = Ridge(alpha=alpha)
elastic = ElasticNet(alpha=alpha, l1_ratio=0.5)
 
lasso.fit(X, y)
ridge.fit(X, y)
elastic.fit(X, y)
 
print(f"{'Method':<15} {'w[0]':<8} {'w[1]':<8} {'w[2]':<8} {'Non-zeros':<10}")
print("-" * 60)
print(f"{'Lasso':<15} {lasso.coef_[0]:<8.3f} {lasso.coef_[1]:<8.3f} "
      f"{lasso.coef_[2]:<8.3f} {np.sum(np.abs(lasso.coef_) > 0.01)}")
print(f"{'Ridge':<15} {ridge.coef_[0]:<8.3f} {ridge.coef_[1]:<8.3f} "
      f"{ridge.coef_[2]:<8.3f} {np.sum(np.abs(ridge.coef_) > 0.01)}")
print(f"{'Elastic Net':<15} {elastic.coef_[0]:<8.3f} {elastic.coef_[1]:<8.3f} "
      f"{elastic.coef_[2]:<8.3f} {np.sum(np.abs(elastic.coef_) > 0.01)}")
 
print("
Notice: Lasso may arbitrarily pick one of the correlated features,")
print("while Ridge and Elastic Net spread weight across all three.")
 
# Vary l1_ratio
print("
--- Effect of l1_ratio (α) ---")
for l1_ratio in [0.1, 0.3, 0.5, 0.7, 0.9]:
    en = ElasticNet(alpha=0.1, l1_ratio=l1_ratio)
    en.fit(X, y)
    nonzeros = np.sum(np.abs(en.coef_) > 0.01)
    print(f"l1_ratio={l1_ratio}: {nonzeros} non-zeros, coef[:3] = {np.round(en.coef_[:3], 3)}")

Choosing Regularization Strength

The regularization strength $\lambda$ (or $\alpha$ in scikit-learn) is a critical hyperparameter. Too small, and regularization has no effect; too large, and the model underfits.

Cross-Validation:

The standard approach is to select $\lambda$ via cross-validation:

Choose a grid of candidate $\lambda$ values (typically logarithmically spaced)
For each $\lambda$, evaluate K-fold cross-validation performance
Select the $\lambda$ with best average validation performance

Common Strategies:

Lambda Selection Strategies
Strategy	Description	When to Use
$\lambda_{\min}$	λ with lowest CV error	Maximum predictive accuracy
$\lambda_{1se}$	Largest λ within 1 SE of minimum	Simpler model, similar performance; prefer for interpretability
AIC/BIC	Information criteria (penalize complexity)	Model selection theory; Bayesian justification
Stability selection	Features selected across multiple subsamples	Robust feature selection

lambda_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV
from sklearn.model_selection import cross_val_score
 
# Generate data
np.random.seed(42)
n, p = 200, 50
X = np.random.randn(n, p)
true_w = np.zeros(p)
true_w[:10] = np.random.randn(10)
y = X @ true_w + 0.5 * np.random.randn(n)
 
print("Automatic Lambda Selection with Cross-Validation")
print("=" * 60)
 
# LassoCV automatically finds optimal lambda
lasso_cv = LassoCV(cv=5, alphas=np.logspace(-4, 1, 50), random_state=42)
lasso_cv.fit(X, y)
 
print(f"
Lasso with CV:")
print(f"  Best α: {lasso_cv.alpha_:.6f}")
print(f"  Non-zero coefficients: {np.sum(np.abs(lasso_cv.coef_) > 1e-6)}")
print(f"  True non-zero: {np.sum(true_w != 0)}")
 
# Ridge CV
ridge_cv = RidgeCV(cv=5, alphas=np.logspace(-4, 4, 50))
ridge_cv.fit(X, y)
print(f"
Ridge with CV:")
print(f"  Best α: {ridge_cv.alpha_:.6f}")
 
# Elastic Net CV
enet_cv = ElasticNetCV(cv=5, l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95], 
                        alphas=np.logspace(-4, 1, 50), random_state=42)
enet_cv.fit(X, y)
print(f"
Elastic Net with CV:")
print(f"  Best α: {enet_cv.alpha_:.6f}")
print(f"  Best l1_ratio: {enet_cv.l1_ratio_:.2f}")
print(f"  Non-zero coefficients: {np.sum(np.abs(enet_cv.coef_) > 1e-6)}")
 
# Regularization path
print("
--- Regularization Path (Lasso) ---")
print(f"{'Alpha':<12} {'Non-zeros':<12} {'CV MSE (mean)':<15}")
print("-" * 45)
for alpha, mse_path in zip(lasso_cv.alphas_[::10], lasso_cv.mse_path_[::10]):
    mean_mse = np.mean(mse_path)
    n_nonzero = np.sum(np.abs(lasso_cv.coef_) > 1e-6) if alpha == lasso_cv.alpha_ else "—"
    print(f"{alpha:<12.6f} {str(n_nonzero):<12} {mean_mse:<15.4f}")

Practical Guidelines

Regularization in Deep Learning

Norm-based regularization extends to deep learning, though the landscape is more complex due to the non-convex loss surfaces and the unique properties of neural networks.

Weight Decay (L2):

In deep learning, L2 regularization is typically implemented as weight decay—adding a fraction of the weights to the gradient update:

$$\mathbf{w}_{t+1} = \mathbf{w}_t - \eta( abla L + \lambda \mathbf{w}_t) = (1 - \eta\lambda)\mathbf{w}_t - \eta abla L$$

The $(1 - \eta\lambda)$ factor 'decays' weights toward zero each step.

Note: With optimizers like Adam, weight decay and L2 regularization are NOT equivalent! Use separate weight decay (AdamW) for proper behavior.

Regularization Techniques in Deep Learning

•Weight Decay (L2): Standard norm penalty on weight matrices; prevents weights from growing too large
•L1 Regularization: Less common in DL; can induce sparse networks (pruning)
•Spectral Normalization: Constrains the spectral norm $|\mathbf{W}|_2$; stabilizes GANs
•Max-Norm Constraint: $|\mathbf{w}|_2 \leq c$ per unit; prevents explosive weights
•Dropout: Implicit regularization via random zeroing; related to L2 for linear models
•Batch Normalization: Reduces internal covariate shift; has regularizing effect
•Data Augmentation: Regularization via expanded training distribution

weight_decay_pytorch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import torch
import torch.nn as nn
 
# Example: Adding weight decay in PyTorch
 
# Method 1: In optimizer (applies to all parameters)
model = nn.Sequential(
    nn.Linear(100, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)
 
optimizer_with_wd = torch.optim.SGD(
    model.parameters(), 
    lr=0.01, 
    weight_decay=0.01  # L2 regularization strength
)
 
# Method 2: AdamW for proper weight decay (decoupled from adaptive learning rate)
optimizer_adamw = torch.optim.AdamW(
    model.parameters(),
    lr=0.001,
    weight_decay=0.01  # Proper decoupled weight decay
)
 
# Method 3: Different decay for different parameter groups
optimizer_custom = torch.optim.AdamW([
    {'params': model[0].parameters(), 'weight_decay': 0.01},  # First layer
    {'params': model[2].parameters(), 'weight_decay': 0.001}  # Second layer (less decay)
], lr=0.001)
 
# Manual L2 regularization (equivalent to weight_decay in SGD)
def train_step_manual_l2(model, optimizer, x, y, l2_lambda=0.01):
    optimizer.zero_grad()
    output = model(x)
    loss = nn.functional.cross_entropy(output, y)
    
    # Add L2 penalty manually
    l2_reg = sum(param.pow(2).sum() for param in model.parameters())
    loss = loss + l2_lambda * l2_reg
    
    loss.backward()
    optimizer.step()
    return loss.item()
 
print("Weight decay in PyTorch:")
print("- Use weight_decay parameter in optimizer")
print("- Prefer AdamW over Adam + L2 for proper decoupled weight decay")
print("- Can specify different decay for different parameter groups")
print("- Typically weight_decay ~ 0.01 to 0.1 for transformers")

Adam vs AdamW

Practical Guidelines for Regularization

After understanding the theory, here are actionable guidelines for applying regularization in practice.

When to Use Which Regularizer

•Use L2 (Ridge) when: You want to shrink all coefficients; features are not obviously irrelevant; you need numerical stability; you want a closed-form solution; multicollinearity is present.
•Use L1 (Lasso) when: You believe many features are irrelevant; interpretability via feature selection matters; you want a sparse model for deployment efficiency; p >> n (high-dimensional, few samples).
•Use Elastic Net when: Features are correlated and you want to group them; Lasso's instability is a problem; you want sparsity but also stability; p >> n with correlated features.
•Use no regularization when: You have abundant data relative to model complexity; you're using other regularization (dropout, augmentation); validation shows no overfitting.

Regularization Hyperparameter Typical Ranges
Method	Parameter	Typical Range	Notes
Ridge	λ (alpha)	10⁻⁴ to 10⁴	Wide range; scale depends on data
Lasso	λ (alpha)	10⁻⁴ to 10¹	Larger values → more zeros
Elastic Net	λ (alpha)	10⁻⁴ to 10¹	Combined overall strength
Elastic Net	l1_ratio	0.1 to 0.99	Higher → more sparsity
Deep Learning	weight_decay	10⁻⁶ to 10⁻¹	0.01-0.1 common for transformers

Essential Preprocessing

•Always standardize features before regularization. Otherwise, features measured in larger units are unfairly penalized less.
•Center the target for regression (or use regularization only on non-intercept terms).
•Handle outliers before regularization; L2 is sensitive to outliers in X.
•Consider feature engineering before relying solely on regularization for selection.

The Regularization Mindset

Summary: Harnessing Norms for Generalization

Key Takeaways

•Regularization adds a norm-based penalty to the loss function, trading off data fit against model complexity.
•L2 (Ridge) shrinks all coefficients proportionally toward zero, has a closed-form solution, and is numerically stable.
•L1 (Lasso) drives some coefficients to exactly zero, performing automatic feature selection via the geometry of its corner-shaped constraint set.
•Elastic Net combines L1 and L2, gaining sparsity from L1 and stability/grouping from L2—ideal for correlated features.
•The regularization strength λ should be chosen via cross-validation; λ_1se often gives simpler, more robust models than λ_min.
•In deep learning, use weight decay (preferably AdamW) for proper L2 regularization; complement with dropout, batch norm, and data augmentation.
•Always standardize features before regularization to ensure fair penalization across different scales.

Module Complete:

Module 5 Complete: Norms and Distance Metrics

5 / 5