Machine LearningStatistical Learning Theory

Regularization Theory

LevelAdvanced

Duration90 mins

TopicStatistical Learning Theory

2 / 5

Regularization as Constraint

The Geometry of Simplicity

When we regularize a learning problem, we are not merely adding a penalty term to our objective function—we are constraining the space of solutions to those with bounded complexity. This constrained optimization perspective reveals the geometric structure underlying regularization and provides powerful insights into why different regularizers have different effects.

In this page, we develop the optimization-theoretic view of regularization, showing how penalty formulations and constraint formulations are two sides of the same coin, connected through the beautiful mathematics of Lagrangian duality.

What You Will Learn

By the end of this page, you will understand: (1) the equivalence between penalized and constrained optimization formulations, (2) how Lagrangian duality connects regularization strength to constraint budgets, (3) the geometric interpretation of regularization in parameter space, (4) how constraint geometry affects solution properties, and (5) practical implications for algorithm design and hyperparameter tuning.

Two Equivalent Formulations

Regularization can be expressed in two mathematically related forms: the penalized (Lagrangian) form and the constrained (Ivanov) form.

The Penalized Form (Tikhonov Regularization)

The penalized form adds a regularization term to the loss:

$$\hat{w}{\lambda} = \arg\min{w} \left[ \mathcal{L}(w; S) + \lambda \Omega(w) \right]$$

where:

$\mathcal{L}(w; S) = \frac{1}{n} \sum_{i=1}^n \ell(f_w(x_i), y_i)$ is the empirical loss
$\Omega(w)$ is the regularization function (e.g., $|w|_2^2$ for L2)
$\lambda \geq 0$ is the regularization strength

This form is computationally convenient — it's an unconstrained optimization problem.

The Constrained Form (Ivanov Regularization)

The constrained form imposes an explicit bound on regularizer:

$$\hat{w}C = \arg\min{w : \Omega(w) \leq C} \mathcal{L}(w; S)$$

where $C > 0$ is the capacity budget — the maximum allowed complexity.

Interpretation: Find the hypothesis that best fits the training data among all hypotheses with complexity at most $C$.

This form is conceptually clear — we explicitly allocate a "budget" of complexity.

Naming Conventions

The penalized form is often called Tikhonov regularization after A.N. Tikhonov who developed it for ill-posed problems. The constrained form is sometimes called Ivanov regularization. In machine learning, we typically use the penalized form because it's easier to optimize, but understanding the constrained form provides geometric intuition.

Equivalence Between Forms

Theorem (Lagrangian Duality): Under mild conditions (convexity of $\mathcal{L}$ and $\Omega$, Slater's condition), the penalized and constrained forms produce the same solutions for corresponding values of $\lambda$ and $C$:

For every $\lambda > 0$, there exists $C(\lambda)$ such that $\hat{w}\lambda = \hat{w}{C(\lambda)}$
For every $C > 0$, there exists $\lambda(C)$ such that $\hat{w}C = \hat{w}{\lambda(C)}$

The relationship:

$C = \Omega(\hat{w}_\lambda)$ — the achieved regularizer value at optimum
$\lambda$ is the Lagrange multiplier for the constraint $\Omega(w) \leq C$

As $\lambda$ increases, $C$ decreases (stricter constraint). As $\lambda$ decreases, $C$ increases (looser constraint).

Comparison of Penalized vs. Constrained Forms
Aspect	Penalized Form	Constrained Form
Formulation	$\min_w [\mathcal{L}(w) + \lambda \Omega(w)]$	$\min_w \mathcal{L}(w)$ s.t. $\Omega(w) \leq C$
Free parameter	$\lambda$ (regularization strength)	$C$ (capacity budget)
Optimization type	Unconstrained	Constrained
Algorithms	Standard gradient descent	Projected gradient, barrier methods
Interpretation	Trade off fit vs. complexity	Best fit within complexity budget
Tuning	Cross-validate over $\lambda$	Cross-validate over $C$

Lagrangian Duality and KKT Conditions

The connection between penalized and constrained forms arises from Lagrangian duality, a cornerstone of optimization theory. Understanding this connection deepens our insight into regularization.

The Lagrangian

For the constrained problem: $$\min_w \mathcal{L}(w) \quad \text{subject to} \quad \Omega(w) \leq C$$

The Lagrangian is: $$L(w, \lambda) = \mathcal{L}(w) + \lambda(\Omega(w) - C)$$

where $\lambda \geq 0$ is the Lagrange multiplier (dual variable).

The Lagrangian converts the constrained problem to an unconstrained one with an interpretable structure:

When $\Omega(w) > C$: the term $\lambda(\Omega(w) - C) > 0$ penalizes constraint violation
When $\Omega(w) \leq C$: the term is non-positive, allowing feasible solutions

The Dual Problem

The dual problem is: $$\max_{\lambda \geq 0} \left[ \min_w L(w, \lambda) \right] = \max_{\lambda \geq 0} \left[ \min_w \mathcal{L}(w) + \lambda(\Omega(w) - C) \right]$$

The inner minimization: $$g(\lambda) = \min_w \left[ \mathcal{L}(w) + \lambda \Omega(w) \right] - \lambda C$$

Note that $\min_w [\mathcal{L}(w) + \lambda \Omega(w)]$ is exactly the penalized form!

Key insight: Solving the penalized form for various $\lambda$ is equivalent to solving the inner optimization of the dual problem. The dual then adjusts $\lambda$ to match the constraint budget $C$.

Strong Duality

When the loss L and regularizer Ω are convex, and Slater's condition holds (there exists a strictly feasible point), strong duality guarantees that the primal and dual optimal values coincide. This is why the penalized and constrained forms are truly equivalent, not just related — they solve the same underlying problem.

KKT Conditions

The Karush-Kuhn-Tucker (KKT) conditions characterize the optimal solution:

1. Stationarity: $$\nabla_w \mathcal{L}(w^) + \lambda^ \nabla \Omega(w^*) = 0$$

2. Primal feasibility: $$\Omega(w^*) \leq C$$

3. Dual feasibility: $$\lambda^* \geq 0$$

4. Complementary slackness: $$\lambda^(\Omega(w^) - C) = 0$$

The complementary slackness condition is particularly revealing:

If $\lambda^* > 0$: The constraint is active ($\Omega(w^*) = C$)
If $\lambda^* = 0$: The constraint is inactive ($\Omega(w^*) < C$) — unconstrained optimum is feasible

Implications for Regularization

When is the constraint active?

The unregularized ERM solution $\hat{w}{\text{ERM}} = \arg\min_w \mathcal{L}(w)$ may or may not satisfy $\Omega(\hat{w}{\text{ERM}}) \leq C$.

Case 1: $\Omega(\hat{w}_{\text{ERM}}) \leq C$ (Constraint satisfied naturally)

The regularized and unregularized solutions are identical
$\lambda^* = 0$ (no regularization needed)
This happens when $C$ is large (generous budget)

Case 2: $\Omega(\hat{w}_{\text{ERM}}) > C$ (Constraint violated)

The regularized solution lies on the constraint boundary: $\Omega(w^*) = C$
$\lambda^* > 0$ (active regularization)
This happens when $C$ is small (tight budget)

The transition: As $C$ decreases from $\infty$ to $0$:

Initially no effect (constraint inactive)
At some threshold, constraint becomes active
Below threshold, solution is pulled toward simpler hypotheses

Geometric Interpretation in Parameter Space

The constrained formulation provides a powerful geometric picture of regularization. In parameter space, the problem becomes finding where level sets of the loss intersect the constraint region.

The Geometry of Regularized Learning

Consider the 2D case with parameters $(w_1, w_2)$:

Loss contours: The sets ${w : \mathcal{L}(w) = c}$ form level curves, typically ellipses for quadratic loss (centered at the ERM solution).

Constraint region: The set ${w : \Omega(w) \leq C}$ defines the "budget region."

For L2 regularization: a ball centered at origin
For L1 regularization: a diamond (hypercube rotated 45°)

The regularized solution: The point where the smallest loss contour touches the constraint boundary.

L2 Constraint Geometry (Ridge)

For L2 regularization, $\Omega(w) = |w|_2^2 = w_1^2 + w_2^2 \leq C$

Constraint shape: Circle (2D), sphere (higher D)

Geometric solution:

Loss contours are typically ellipses
The optimal point is where an ellipse is tangent to the sphere
This point is generally not on any axis

Effect on parameters:

All parameters shrink proportionally toward zero
No parameter is exactly zero (unless the ERM solution had it zero)
The solution is a scaled version of the unconstrained solution direction

Intuition: The L2 ball is "smooth" — it has no corners. The tangent point can be anywhere on the sphere surface.

L1 Constraint Geometry (Lasso)

For L1 regularization, $\Omega(w) = |w|_1 = |w_1| + |w_2| \leq C$

Constraint shape: Diamond/rhombus (2D), cross-polytope (higher D)

Geometric solution:

Loss contours (ellipses) hitting a diamond
The optimal point is often at a corner of the diamond
Corners lie on coordinate axes

Effect on parameters:

Solutions tend to lie at corners → some parameters exactly zero
This produces sparse solutions
L1 performs automatic feature selection

Intuition: The L1 diamond has sharp corners on the axes. Loss ellipses are more likely to first touch a corner than a face, driving some parameters to exactly zero.

Mathematical reason: The L1 ball surface has no curvature at the corners. Any descent direction from a corner along the surface immediately increases some coordinate from zero — so staying at the corner is optimal unless the gradient strongly points away from the axis.

Why L1 Induces Sparsity

The key geometric insight: L1 balls have corners on the axes, while L2 balls have smooth surfaces. When loss contours intersect these shapes, they're much more likely to hit L1 corners than L2 surfaces along axes. At a corner, one or more coordinates are exactly zero, producing sparse solutions. This is why L1 (Lasso) is used for feature selection while L2 (Ridge) just shrinks all weights uniformly.

Geometric Properties of Common Regularizers
Regularizer	Constraint Shape	Corners	Sparsity-Inducing
L2 (Ridge)	Sphere	None (smooth)	No
L1 (Lasso)	Cross-polytope	On axes	Yes
L∞	Hypercube	On vertices	Rarely
Elastic Net	Rounded diamond	Near axes	Yes (less than L1)
Group Lasso	Union of cones	On group boundaries	Group sparsity

The Regularization Path

As we vary the regularization strength from $\lambda = 0$ to $\lambda = \infty$ (or equivalently, $C$ from $\infty$ to $0$), the optimal solution traces a path through parameter space. Understanding this regularization path provides insight into how regularization affects model complexity.

The Path Concept

The regularization path is the set: $${\hat{w}(\lambda) : \lambda \in [0, \infty)}$$

where $\hat{w}(\lambda) = \arg\min_w [\mathcal{L}(w) + \lambda \Omega(w)]$

Key properties:

At $\lambda = 0$: $\hat{w}(0) = \hat{w}_{\text{ERM}}$ (unregularized solution)
As $\lambda \to \infty$: $\hat{w}(\lambda) \to \arg\min_w \Omega(w)$ (typically approaches origin)
The path is continuous in $\lambda$ for convex problems

L2 Regularization Path (Ridge)

For linear regression with L2 regularization: $$\hat{w}(\lambda) = (X^\top X + \lambda I)^{-1} X^\top y$$

Path properties:

Closed-form solution for all $\lambda$
Each component $\hat{w}_j(\lambda)$ shrinks monotonically toward zero
The path is a smooth curve from ERM to origin
Components shrink proportionally — ratios remain approximately constant

Computational note: To compute the path, we can solve for multiple $\lambda$ values efficiently using the SVD decomposition of $X$.

L1 Regularization Path (Lasso)

For linear regression with L1 regularization: $$\hat{w}(\lambda) = \arg\min_w \frac{1}{2}|y - Xw|_2^2 + \lambda |w|_1$$

Path properties:

The path is piecewise linear in $\lambda$
As $\lambda$ increases, variables enter/exit the active set at discrete "knots"
Each segment corresponds to a fixed active set (non-zero variables)

The LARS algorithm: (Least Angle Regression)

Exploits piecewise linearity to computeentire path efficiently
Complexity: approximately same as single OLS fit
At each knot, a new variable enters or an existing variable leaves

Path Algorithms

The ability to compute entire regularization paths efficiently is practically valuable. For Lasso, the LARS algorithm (Efron et al., 2004) computes the solution for all λ values at once. For Ridge, the closed-form solution enables similar efficiency. This is exploited in cross-validation: compute the path once, then evaluate at many λ values.

Regularization Path and Model Selection

The regularization path directly supports model selection:

1. Computational efficiency:

Compute path once ($O(n d^2)$ for Lasso/LARS)
Evaluate cross-validation error at many $\lambda$ values cheaply

2. Understanding variable importance:

Order in which variables enter (Lasso) indicates importance
Variables entering at low $\lambda$ are most predictive

3. Stability analysis:

If the path is smooth, solutions are stable to $\lambda$ perturbations
If many variables enter/exit rapidly, there may be collinearity issues

4. Degrees of freedom:

The effective degrees of freedom varies along the path
Connects to information criteria (AIC, BIC) for model selection

The Projection Interpretation

Another powerful geometric interpretation views regularization as a projection operation, connecting it to the theory of projections in Hilbert spaces.

Regularization as Projection

Consider the constrained form: $$\hat{w}C = \arg\min{w : \Omega(w) \leq C} \mathcal{L}(w)$$

This can be rewritten using orthogonal projection:

Step 1: Compute unconstrained optimum $\hat{w}_{\text{ERM}} = \arg\min_w \mathcal{L}(w)$

Step 2: If $\Omega(\hat{w}_{\text{ERM}}) \leq C$, done. Otherwise:

Step 3: Project onto constraint set: $$\hat{w}C = \text{Proj}{\Omega \leq C}(\hat{w}_{\text{ERM}})$$

For L2 regularization, this projection has a closed form.

L2 Projection: Rescaling

For the L2 constraint $|w|_2 \leq \sqrt{C}$:

$$\text{Proj}_{|\cdot|_2 \leq r}(w) = \begin{cases} w & \text{if } |w|_2 \leq r \ r \cdot \frac{w}{|w|_2} & \text{if } |w|_2 > r \end{cases}$$

Interpretation: Project by rescaling the vector to have norm exactly $r$, preserving direction.

Effect: All parameters shrink by the same factor. Relative magnitudes are preserved.

L1 Projection: Soft Thresholding

For the L1 constraint $|w|_1 \leq C$, the projection involves the soft-thresholding operator:

$$S_\tau(w_j) = \text{sign}(w_j) \cdot \max(|w_j| - \tau, 0)$$

where $\tau$ is the threshold level (determined by $C$).

Interpretation:

Parameters with $|w_j| \leq \tau$ are set to zero
Parameters with $|w_j| > \tau$ are shrunk toward zero by amount $\tau$

Effect: Small parameters disappear entirely; large parameters shrink by a constant amount. This is the sparsity-inducing property of L1.

Soft vs. Hard Thresholding

Hard thresholding (used in best-subset selection) sets small parameters to zero and leaves large parameters unchanged: $H_τ(w) = w \cdot \mathbb{1}[|w| > τ]$. Soft thresholding (from L1) both zeros small parameters and shrinks large ones. Soft thresholding is continuous and convex; hard thresholding is discontinuous and leads to NP-hard optimization.

Projected Gradient Descent

The projection interpretation suggests a natural optimization algorithm:

Projected Gradient Descent:

Initialize w⁰
For t = 0, 1, 2, ...:
    w^{t+1/2} = w^t - η ∇L(w^t)    # Gradient step
    w^{t+1} = Proj_{Ω ≤ C}(w^{t+1/2})   # Project back

Properties:

Converges to constrained optimum for convex problems
Each iteration maintains feasibility
Projection step ensures constraint satisfaction

Proximal perspective: The projection can be viewed as solving: $$\text{Proj}{\Omega \leq C}(z) = \arg\min{w : \Omega(w) \leq C} |w - z|_2^2$$

This connects to proximal gradient methods, a powerful framework for optimization with non-smooth regularizers.

Constraint Design Principles

Understanding the constraint view suggests principles for designing effective regularizers based on the desired properties of solutions.

Principle 1: Constraint Shape Determines Solution Properties

The geometry of the constraint set ${w : \Omega(w) \leq C}$ directly determines solution characteristics:

Constraint Shape	Solution Property	Example
Spherical (L2)	Uniform shrinkage	Ridge regression
Diamond (L1)	Coordinate sparsity	Lasso
Cylindrical	Sparsity in some coordinates	Partial penalization
Polyhedral	Sparse, structured	Group Lasso variants
Non-convex	Many local minima	SCAD, MCP

Principle 2: Convexity Ensures Tractability

Convex constraints lead to:

Unique global optimum (for strictly convex loss)
Efficient optimization algorithms
Theoretical guarantees

Non-convex constraints may provide:

Better approximation to ideal sparse solution
Potential for improved statistical properties
But: harder optimization, local minima concerns

Practical recommendation: Start with convex regularizers (L1, L2, Elastic Net). Consider non-convex only if convex solutions are unsatisfactory and you have robust optimization procedures.

Principle 3: Match Constraint to Prior Knowledge

The constraint set should encode domain knowledge about the problem:

Sparsity expected: Use L1 or similar (many features irrelevant)

Smoothness expected: Use L2 or Sobolev norms (adjacent features related)

Group structure known: Use Group Lasso (features come in groups)

Monotonicity expected: Use isotonic constraints (response should increase in certain features)

Low-rank expected: Use nuclear norm (for matrix-valued parameters)

Constraint as Inductive Bias

The constraint set is a form of inductive bias — it encodes which solutions are a priori plausible. Choosing the right constraint is as important as choosing the right model architecture. Domain expertise about the problem should guide this choice.

Principle 4: Budget Calibration Matters

The constraint budget $C$ (or equivalently, $\lambda$) must be calibrated to the problem:

Too large $C$ (too small $\lambda$):

Constraint inactive
No regularization effect
Risk of overfitting

Too small $C$ (too large $\lambda$):

Severe underfitting
Ignoring useful signal
Convergence issues

Optimal $C$:

Balance between fit and complexity
Typically found via cross-validation
Depends on signal-to-noise ratio in data

Practical Considerations

The constraint perspective on regularization has important practical implications for implementation and tuning.

Choosing Between Penalty and Constraint Forms

When to use penalized form ($\lambda$):

Standard libraries expect this form
Easier unconstrained optimization
Natural for gradient-based methods

When to use constrained form ($C$):

Explicit complexity budget from domain knowledge
Easier to interpret (e.g., "total weight magnitude ≤ 10")
Natural for projected gradient methods

Tuning Regularization Strength

Cross-validation: Most common approach

Split data into K folds
For each $\lambda$, train on K-1 folds, evaluate on held-out
Select $\lambda$ with lowest validation error

Information criteria: AIC, BIC $$\text{AIC} = 2k - 2\ln(\hat{L})$$ $$\text{BIC} = k\ln(n) - 2\ln(\hat{L})$$

Estimate effective degrees of freedom
Faster than cross-validation

Bayesian approaches:

Place prior on $\lambda$ itself
Marginalize or optimize hyperparameters
Automatic relevance determination (ARD)

Best Practices

•Standardize features before regularization
•Log-scale grid for λ (e.g., 10⁻⁴ to 10⁴)
•Warm-start optimization at nearby λ values
•Compute path when possible, not point solutions
•Monitor training and validation curves

Common Pitfalls

•Regularizing intercept (usually excluded)
•Too coarse λ grid missing optimum
•Ignoring scale differences in features
•Using training error to select λ
•Not checking stability of selected λ

Feature Scaling Matters

Regularization penalizes parameter magnitudes, so feature scaling critically affects results. If feature x₁ ranges [0,1] and x₂ ranges [0,1000], their coefficients have different scales. Standard practice: standardize to zero mean, unit variance before regularization. The intercept is typically not penalized.

Summary: Regularization as Constraint

We have developed a comprehensive understanding of regularization from the optimization perspective. Let us consolidate the key insights:

Key Takeaways

•Two equivalent formulations: Penalized (add $\lambda\Omega(w)$) and constrained ($\Omega(w) \leq C$) forms are connected through Lagrangian duality.
•Lagrangian duality: $\lambda$ is the Lagrange multiplier for the constraint; KKT conditions characterize optimal solutions.
•Geometric interpretation: Constraints define feasible regions in parameter space; solution is where loss contours touch constraint boundary.
•Constraint shape determines properties: L2 balls give uniform shrinkage; L1 diamonds induce sparsity at corners.
•Regularization path: As $\lambda$ varies, solutions trace a path from ERM to origin; efficiently computable for many regularizers.
•Projection interpretation: Regularization projects unconstrained solution onto constraint set; leads to projected gradient algorithms.
•Design principles: Match constraint geometry to prior knowledge; calibrate budget via cross-validation.

What's Next

The constraint view provides optimization-theoretic insights. In the next page, we explore an entirely complementary perspective: Regularization as Prior. This Bayesian view interprets the regularizer as encoding prior beliefs about parameter distributions, connecting regularization to fundamental principles of probabilistic inference.

Page Complete

You now understand regularization from the optimization perspective: the equivalence of penalized and constrained forms, the geometry of constraint sets, the regularization path, and practical implementation considerations. This foundation complements the Bayesian view of the next page.

2 / 5

Loading learning content...

Machine LearningStatistical Learning Theory

Regularization Theory

LevelAdvanced

Duration90 mins

TopicStatistical Learning Theory

2 / 5

Regularization as Constraint

The Geometry of Simplicity

What You Will Learn

Two Equivalent Formulations

Regularization can be expressed in two mathematically related forms: the penalized (Lagrangian) form and the constrained (Ivanov) form.

The Penalized Form (Tikhonov Regularization)

The penalized form adds a regularization term to the loss:

$$\hat{w}{\lambda} = \arg\min{w} \left[ \mathcal{L}(w; S) + \lambda \Omega(w) \right]$$

where:

$\mathcal{L}(w; S) = \frac{1}{n} \sum_{i=1}^n \ell(f_w(x_i), y_i)$ is the empirical loss
$\Omega(w)$ is the regularization function (e.g., $|w|_2^2$ for L2)
$\lambda \geq 0$ is the regularization strength

This form is computationally convenient — it's an unconstrained optimization problem.

The Constrained Form (Ivanov Regularization)

The constrained form imposes an explicit bound on regularizer:

$$\hat{w}C = \arg\min{w : \Omega(w) \leq C} \mathcal{L}(w; S)$$

where $C > 0$ is the capacity budget — the maximum allowed complexity.

Interpretation: Find the hypothesis that best fits the training data among all hypotheses with complexity at most $C$.

This form is conceptually clear — we explicitly allocate a "budget" of complexity.

Naming Conventions

Equivalence Between Forms

For every $\lambda > 0$, there exists $C(\lambda)$ such that $\hat{w}\lambda = \hat{w}{C(\lambda)}$
For every $C > 0$, there exists $\lambda(C)$ such that $\hat{w}C = \hat{w}{\lambda(C)}$

The relationship:

$C = \Omega(\hat{w}_\lambda)$ — the achieved regularizer value at optimum
$\lambda$ is the Lagrange multiplier for the constraint $\Omega(w) \leq C$

As $\lambda$ increases, $C$ decreases (stricter constraint). As $\lambda$ decreases, $C$ increases (looser constraint).

Comparison of Penalized vs. Constrained Forms
Aspect	Penalized Form	Constrained Form
Formulation	$\min_w [\mathcal{L}(w) + \lambda \Omega(w)]$	$\min_w \mathcal{L}(w)$ s.t. $\Omega(w) \leq C$
Free parameter	$\lambda$ (regularization strength)	$C$ (capacity budget)
Optimization type	Unconstrained	Constrained
Algorithms	Standard gradient descent	Projected gradient, barrier methods
Interpretation	Trade off fit vs. complexity	Best fit within complexity budget
Tuning	Cross-validate over $\lambda$	Cross-validate over $C$

Lagrangian Duality and KKT Conditions

The connection between penalized and constrained forms arises from Lagrangian duality, a cornerstone of optimization theory. Understanding this connection deepens our insight into regularization.

The Lagrangian

For the constrained problem: $$\min_w \mathcal{L}(w) \quad \text{subject to} \quad \Omega(w) \leq C$$

The Lagrangian is: $$L(w, \lambda) = \mathcal{L}(w) + \lambda(\Omega(w) - C)$$

where $\lambda \geq 0$ is the Lagrange multiplier (dual variable).

The Lagrangian converts the constrained problem to an unconstrained one with an interpretable structure:

When $\Omega(w) > C$: the term $\lambda(\Omega(w) - C) > 0$ penalizes constraint violation
When $\Omega(w) \leq C$: the term is non-positive, allowing feasible solutions

The Dual Problem

The dual problem is: $$\max_{\lambda \geq 0} \left[ \min_w L(w, \lambda) \right] = \max_{\lambda \geq 0} \left[ \min_w \mathcal{L}(w) + \lambda(\Omega(w) - C) \right]$$

The inner minimization: $$g(\lambda) = \min_w \left[ \mathcal{L}(w) + \lambda \Omega(w) \right] - \lambda C$$

Note that $\min_w [\mathcal{L}(w) + \lambda \Omega(w)]$ is exactly the penalized form!

Strong Duality

KKT Conditions

The Karush-Kuhn-Tucker (KKT) conditions characterize the optimal solution:

1. Stationarity: $$\nabla_w \mathcal{L}(w^) + \lambda^ \nabla \Omega(w^*) = 0$$

2. Primal feasibility: $$\Omega(w^*) \leq C$$

3. Dual feasibility: $$\lambda^* \geq 0$$

4. Complementary slackness: $$\lambda^(\Omega(w^) - C) = 0$$

The complementary slackness condition is particularly revealing:

If $\lambda^* > 0$: The constraint is active ($\Omega(w^*) = C$)
If $\lambda^* = 0$: The constraint is inactive ($\Omega(w^*) < C$) — unconstrained optimum is feasible

Implications for Regularization

When is the constraint active?

The unregularized ERM solution $\hat{w}{\text{ERM}} = \arg\min_w \mathcal{L}(w)$ may or may not satisfy $\Omega(\hat{w}{\text{ERM}}) \leq C$.

Case 1: $\Omega(\hat{w}_{\text{ERM}}) \leq C$ (Constraint satisfied naturally)

The regularized and unregularized solutions are identical
$\lambda^* = 0$ (no regularization needed)
This happens when $C$ is large (generous budget)

Case 2: $\Omega(\hat{w}_{\text{ERM}}) > C$ (Constraint violated)

The regularized solution lies on the constraint boundary: $\Omega(w^*) = C$
$\lambda^* > 0$ (active regularization)
This happens when $C$ is small (tight budget)

The transition: As $C$ decreases from $\infty$ to $0$:

Initially no effect (constraint inactive)
At some threshold, constraint becomes active
Below threshold, solution is pulled toward simpler hypotheses

Geometric Interpretation in Parameter Space

The constrained formulation provides a powerful geometric picture of regularization. In parameter space, the problem becomes finding where level sets of the loss intersect the constraint region.

The Geometry of Regularized Learning

Consider the 2D case with parameters $(w_1, w_2)$:

Loss contours: The sets ${w : \mathcal{L}(w) = c}$ form level curves, typically ellipses for quadratic loss (centered at the ERM solution).

Constraint region: The set ${w : \Omega(w) \leq C}$ defines the "budget region."

For L2 regularization: a ball centered at origin
For L1 regularization: a diamond (hypercube rotated 45°)

The regularized solution: The point where the smallest loss contour touches the constraint boundary.

L2 Constraint Geometry (Ridge)

For L2 regularization, $\Omega(w) = |w|_2^2 = w_1^2 + w_2^2 \leq C$

Constraint shape: Circle (2D), sphere (higher D)

Geometric solution:

Loss contours are typically ellipses
The optimal point is where an ellipse is tangent to the sphere
This point is generally not on any axis

Effect on parameters:

All parameters shrink proportionally toward zero
No parameter is exactly zero (unless the ERM solution had it zero)
The solution is a scaled version of the unconstrained solution direction

Intuition: The L2 ball is "smooth" — it has no corners. The tangent point can be anywhere on the sphere surface.

L1 Constraint Geometry (Lasso)

For L1 regularization, $\Omega(w) = |w|_1 = |w_1| + |w_2| \leq C$

Constraint shape: Diamond/rhombus (2D), cross-polytope (higher D)

Geometric solution:

Loss contours (ellipses) hitting a diamond
The optimal point is often at a corner of the diamond
Corners lie on coordinate axes

Effect on parameters:

Solutions tend to lie at corners → some parameters exactly zero
This produces sparse solutions
L1 performs automatic feature selection

Intuition: The L1 diamond has sharp corners on the axes. Loss ellipses are more likely to first touch a corner than a face, driving some parameters to exactly zero.

Why L1 Induces Sparsity

Geometric Properties of Common Regularizers
Regularizer	Constraint Shape	Corners	Sparsity-Inducing
L2 (Ridge)	Sphere	None (smooth)	No
L1 (Lasso)	Cross-polytope	On axes	Yes
L∞	Hypercube	On vertices	Rarely
Elastic Net	Rounded diamond	Near axes	Yes (less than L1)
Group Lasso	Union of cones	On group boundaries	Group sparsity

The Regularization Path

The Path Concept

The regularization path is the set: $${\hat{w}(\lambda) : \lambda \in [0, \infty)}$$

where $\hat{w}(\lambda) = \arg\min_w [\mathcal{L}(w) + \lambda \Omega(w)]$

Key properties:

At $\lambda = 0$: $\hat{w}(0) = \hat{w}_{\text{ERM}}$ (unregularized solution)
As $\lambda \to \infty$: $\hat{w}(\lambda) \to \arg\min_w \Omega(w)$ (typically approaches origin)
The path is continuous in $\lambda$ for convex problems

L2 Regularization Path (Ridge)

For linear regression with L2 regularization: $$\hat{w}(\lambda) = (X^\top X + \lambda I)^{-1} X^\top y$$

Path properties:

Closed-form solution for all $\lambda$
Each component $\hat{w}_j(\lambda)$ shrinks monotonically toward zero
The path is a smooth curve from ERM to origin
Components shrink proportionally — ratios remain approximately constant

Computational note: To compute the path, we can solve for multiple $\lambda$ values efficiently using the SVD decomposition of $X$.

L1 Regularization Path (Lasso)

For linear regression with L1 regularization: $$\hat{w}(\lambda) = \arg\min_w \frac{1}{2}|y - Xw|_2^2 + \lambda |w|_1$$

Path properties:

The path is piecewise linear in $\lambda$
As $\lambda$ increases, variables enter/exit the active set at discrete "knots"
Each segment corresponds to a fixed active set (non-zero variables)

The LARS algorithm: (Least Angle Regression)

Exploits piecewise linearity to computeentire path efficiently
Complexity: approximately same as single OLS fit
At each knot, a new variable enters or an existing variable leaves

Path Algorithms

Regularization Path and Model Selection

The regularization path directly supports model selection:

1. Computational efficiency:

Compute path once ($O(n d^2)$ for Lasso/LARS)
Evaluate cross-validation error at many $\lambda$ values cheaply

2. Understanding variable importance:

Order in which variables enter (Lasso) indicates importance
Variables entering at low $\lambda$ are most predictive

3. Stability analysis:

If the path is smooth, solutions are stable to $\lambda$ perturbations
If many variables enter/exit rapidly, there may be collinearity issues

4. Degrees of freedom:

The effective degrees of freedom varies along the path
Connects to information criteria (AIC, BIC) for model selection

The Projection Interpretation

Another powerful geometric interpretation views regularization as a projection operation, connecting it to the theory of projections in Hilbert spaces.

Regularization as Projection

Consider the constrained form: $$\hat{w}C = \arg\min{w : \Omega(w) \leq C} \mathcal{L}(w)$$

This can be rewritten using orthogonal projection:

Step 1: Compute unconstrained optimum $\hat{w}_{\text{ERM}} = \arg\min_w \mathcal{L}(w)$

Step 2: If $\Omega(\hat{w}_{\text{ERM}}) \leq C$, done. Otherwise:

Step 3: Project onto constraint set: $$\hat{w}C = \text{Proj}{\Omega \leq C}(\hat{w}_{\text{ERM}})$$

For L2 regularization, this projection has a closed form.

L2 Projection: Rescaling

For the L2 constraint $|w|_2 \leq \sqrt{C}$:

$$\text{Proj}_{|\cdot|_2 \leq r}(w) = \begin{cases} w & \text{if } |w|_2 \leq r \ r \cdot \frac{w}{|w|_2} & \text{if } |w|_2 > r \end{cases}$$

Interpretation: Project by rescaling the vector to have norm exactly $r$, preserving direction.

Effect: All parameters shrink by the same factor. Relative magnitudes are preserved.

L1 Projection: Soft Thresholding

For the L1 constraint $|w|_1 \leq C$, the projection involves the soft-thresholding operator:

$$S_\tau(w_j) = \text{sign}(w_j) \cdot \max(|w_j| - \tau, 0)$$

where $\tau$ is the threshold level (determined by $C$).

Interpretation:

Parameters with $|w_j| \leq \tau$ are set to zero
Parameters with $|w_j| > \tau$ are shrunk toward zero by amount $\tau$

Effect: Small parameters disappear entirely; large parameters shrink by a constant amount. This is the sparsity-inducing property of L1.

Soft vs. Hard Thresholding

Projected Gradient Descent

The projection interpretation suggests a natural optimization algorithm:

Projected Gradient Descent:

Initialize w⁰
For t = 0, 1, 2, ...:
    w^{t+1/2} = w^t - η ∇L(w^t)    # Gradient step
    w^{t+1} = Proj_{Ω ≤ C}(w^{t+1/2})   # Project back

Properties:

Converges to constrained optimum for convex problems
Each iteration maintains feasibility
Projection step ensures constraint satisfaction

Proximal perspective: The projection can be viewed as solving: $$\text{Proj}{\Omega \leq C}(z) = \arg\min{w : \Omega(w) \leq C} |w - z|_2^2$$

This connects to proximal gradient methods, a powerful framework for optimization with non-smooth regularizers.

Constraint Design Principles

Understanding the constraint view suggests principles for designing effective regularizers based on the desired properties of solutions.

Principle 1: Constraint Shape Determines Solution Properties

The geometry of the constraint set ${w : \Omega(w) \leq C}$ directly determines solution characteristics:

Constraint Shape	Solution Property	Example
Spherical (L2)	Uniform shrinkage	Ridge regression
Diamond (L1)	Coordinate sparsity	Lasso
Cylindrical	Sparsity in some coordinates	Partial penalization
Polyhedral	Sparse, structured	Group Lasso variants
Non-convex	Many local minima	SCAD, MCP

Principle 2: Convexity Ensures Tractability

Convex constraints lead to:

Unique global optimum (for strictly convex loss)
Efficient optimization algorithms
Theoretical guarantees

Non-convex constraints may provide:

Better approximation to ideal sparse solution
Potential for improved statistical properties
But: harder optimization, local minima concerns

Practical recommendation: Start with convex regularizers (L1, L2, Elastic Net). Consider non-convex only if convex solutions are unsatisfactory and you have robust optimization procedures.

Principle 3: Match Constraint to Prior Knowledge

The constraint set should encode domain knowledge about the problem:

Sparsity expected: Use L1 or similar (many features irrelevant)

Smoothness expected: Use L2 or Sobolev norms (adjacent features related)

Group structure known: Use Group Lasso (features come in groups)

Monotonicity expected: Use isotonic constraints (response should increase in certain features)

Low-rank expected: Use nuclear norm (for matrix-valued parameters)

Constraint as Inductive Bias

Principle 4: Budget Calibration Matters

The constraint budget $C$ (or equivalently, $\lambda$) must be calibrated to the problem:

Too large $C$ (too small $\lambda$):

Constraint inactive
No regularization effect
Risk of overfitting

Too small $C$ (too large $\lambda$):

Severe underfitting
Ignoring useful signal
Convergence issues

Optimal $C$:

Balance between fit and complexity
Typically found via cross-validation
Depends on signal-to-noise ratio in data

Practical Considerations

The constraint perspective on regularization has important practical implications for implementation and tuning.

Choosing Between Penalty and Constraint Forms

When to use penalized form ($\lambda$):

Standard libraries expect this form
Easier unconstrained optimization
Natural for gradient-based methods

When to use constrained form ($C$):

Explicit complexity budget from domain knowledge
Easier to interpret (e.g., "total weight magnitude ≤ 10")
Natural for projected gradient methods

Tuning Regularization Strength

Cross-validation: Most common approach

Split data into K folds
For each $\lambda$, train on K-1 folds, evaluate on held-out
Select $\lambda$ with lowest validation error

Information criteria: AIC, BIC $$\text{AIC} = 2k - 2\ln(\hat{L})$$ $$\text{BIC} = k\ln(n) - 2\ln(\hat{L})$$

Estimate effective degrees of freedom
Faster than cross-validation

Bayesian approaches:

Place prior on $\lambda$ itself
Marginalize or optimize hyperparameters
Automatic relevance determination (ARD)

Best Practices

•Standardize features before regularization
•Log-scale grid for λ (e.g., 10⁻⁴ to 10⁴)
•Warm-start optimization at nearby λ values
•Compute path when possible, not point solutions
•Monitor training and validation curves

Common Pitfalls

•Regularizing intercept (usually excluded)
•Too coarse λ grid missing optimum
•Ignoring scale differences in features
•Using training error to select λ
•Not checking stability of selected λ

Feature Scaling Matters

Summary: Regularization as Constraint

We have developed a comprehensive understanding of regularization from the optimization perspective. Let us consolidate the key insights:

Key Takeaways

•Two equivalent formulations: Penalized (add $\lambda\Omega(w)$) and constrained ($\Omega(w) \leq C$) forms are connected through Lagrangian duality.
•Lagrangian duality: $\lambda$ is the Lagrange multiplier for the constraint; KKT conditions characterize optimal solutions.
•Geometric interpretation: Constraints define feasible regions in parameter space; solution is where loss contours touch constraint boundary.
•Constraint shape determines properties: L2 balls give uniform shrinkage; L1 diamonds induce sparsity at corners.
•Regularization path: As $\lambda$ varies, solutions trace a path from ERM to origin; efficiently computable for many regularizers.
•Projection interpretation: Regularization projects unconstrained solution onto constraint set; leads to projected gradient algorithms.
•Design principles: Match constraint geometry to prior knowledge; calibrate budget via cross-validation.

What's Next

Page Complete

2 / 5