Loading learning content...
When we regularize a learning problem, we are not merely adding a penalty term to our objective function—we are constraining the space of solutions to those with bounded complexity. This constrained optimization perspective reveals the geometric structure underlying regularization and provides powerful insights into why different regularizers have different effects.
In this page, we develop the optimization-theoretic view of regularization, showing how penalty formulations and constraint formulations are two sides of the same coin, connected through the beautiful mathematics of Lagrangian duality.
By the end of this page, you will understand: (1) the equivalence between penalized and constrained optimization formulations, (2) how Lagrangian duality connects regularization strength to constraint budgets, (3) the geometric interpretation of regularization in parameter space, (4) how constraint geometry affects solution properties, and (5) practical implications for algorithm design and hyperparameter tuning.
Regularization can be expressed in two mathematically related forms: the penalized (Lagrangian) form and the constrained (Ivanov) form.
The penalized form adds a regularization term to the loss:
$$\hat{w}{\lambda} = \arg\min{w} \left[ \mathcal{L}(w; S) + \lambda \Omega(w) \right]$$
where:
This form is computationally convenient — it's an unconstrained optimization problem.
The constrained form imposes an explicit bound on regularizer:
$$\hat{w}C = \arg\min{w : \Omega(w) \leq C} \mathcal{L}(w; S)$$
where $C > 0$ is the capacity budget — the maximum allowed complexity.
Interpretation: Find the hypothesis that best fits the training data among all hypotheses with complexity at most $C$.
This form is conceptually clear — we explicitly allocate a "budget" of complexity.
The penalized form is often called Tikhonov regularization after A.N. Tikhonov who developed it for ill-posed problems. The constrained form is sometimes called Ivanov regularization. In machine learning, we typically use the penalized form because it's easier to optimize, but understanding the constrained form provides geometric intuition.
Theorem (Lagrangian Duality): Under mild conditions (convexity of $\mathcal{L}$ and $\Omega$, Slater's condition), the penalized and constrained forms produce the same solutions for corresponding values of $\lambda$ and $C$:
The relationship:
As $\lambda$ increases, $C$ decreases (stricter constraint). As $\lambda$ decreases, $C$ increases (looser constraint).
| Aspect | Penalized Form | Constrained Form |
|---|---|---|
| Formulation | $\min_w [\mathcal{L}(w) + \lambda \Omega(w)]$ | $\min_w \mathcal{L}(w)$ s.t. $\Omega(w) \leq C$ |
| Free parameter | $\lambda$ (regularization strength) | $C$ (capacity budget) |
| Optimization type | Unconstrained | Constrained |
| Algorithms | Standard gradient descent | Projected gradient, barrier methods |
| Interpretation | Trade off fit vs. complexity | Best fit within complexity budget |
| Tuning | Cross-validate over $\lambda$ | Cross-validate over $C$ |
The connection between penalized and constrained forms arises from Lagrangian duality, a cornerstone of optimization theory. Understanding this connection deepens our insight into regularization.
For the constrained problem: $$\min_w \mathcal{L}(w) \quad \text{subject to} \quad \Omega(w) \leq C$$
The Lagrangian is: $$L(w, \lambda) = \mathcal{L}(w) + \lambda(\Omega(w) - C)$$
where $\lambda \geq 0$ is the Lagrange multiplier (dual variable).
The Lagrangian converts the constrained problem to an unconstrained one with an interpretable structure:
The dual problem is: $$\max_{\lambda \geq 0} \left[ \min_w L(w, \lambda) \right] = \max_{\lambda \geq 0} \left[ \min_w \mathcal{L}(w) + \lambda(\Omega(w) - C) \right]$$
The inner minimization: $$g(\lambda) = \min_w \left[ \mathcal{L}(w) + \lambda \Omega(w) \right] - \lambda C$$
Note that $\min_w [\mathcal{L}(w) + \lambda \Omega(w)]$ is exactly the penalized form!
Key insight: Solving the penalized form for various $\lambda$ is equivalent to solving the inner optimization of the dual problem. The dual then adjusts $\lambda$ to match the constraint budget $C$.
When the loss L and regularizer Ω are convex, and Slater's condition holds (there exists a strictly feasible point), strong duality guarantees that the primal and dual optimal values coincide. This is why the penalized and constrained forms are truly equivalent, not just related — they solve the same underlying problem.
The Karush-Kuhn-Tucker (KKT) conditions characterize the optimal solution:
1. Stationarity: $$\nabla_w \mathcal{L}(w^) + \lambda^ \nabla \Omega(w^*) = 0$$
2. Primal feasibility: $$\Omega(w^*) \leq C$$
3. Dual feasibility: $$\lambda^* \geq 0$$
4. Complementary slackness: $$\lambda^(\Omega(w^) - C) = 0$$
The complementary slackness condition is particularly revealing:
When is the constraint active?
The unregularized ERM solution $\hat{w}{\text{ERM}} = \arg\min_w \mathcal{L}(w)$ may or may not satisfy $\Omega(\hat{w}{\text{ERM}}) \leq C$.
Case 1: $\Omega(\hat{w}_{\text{ERM}}) \leq C$ (Constraint satisfied naturally)
Case 2: $\Omega(\hat{w}_{\text{ERM}}) > C$ (Constraint violated)
The transition: As $C$ decreases from $\infty$ to $0$:
The constrained formulation provides a powerful geometric picture of regularization. In parameter space, the problem becomes finding where level sets of the loss intersect the constraint region.
Consider the 2D case with parameters $(w_1, w_2)$:
Loss contours: The sets ${w : \mathcal{L}(w) = c}$ form level curves, typically ellipses for quadratic loss (centered at the ERM solution).
Constraint region: The set ${w : \Omega(w) \leq C}$ defines the "budget region."
The regularized solution: The point where the smallest loss contour touches the constraint boundary.
For L2 regularization, $\Omega(w) = |w|_2^2 = w_1^2 + w_2^2 \leq C$
Constraint shape: Circle (2D), sphere (higher D)
Geometric solution:
Effect on parameters:
Intuition: The L2 ball is "smooth" — it has no corners. The tangent point can be anywhere on the sphere surface.
For L1 regularization, $\Omega(w) = |w|_1 = |w_1| + |w_2| \leq C$
Constraint shape: Diamond/rhombus (2D), cross-polytope (higher D)
Geometric solution:
Effect on parameters:
Intuition: The L1 diamond has sharp corners on the axes. Loss ellipses are more likely to first touch a corner than a face, driving some parameters to exactly zero.
Mathematical reason: The L1 ball surface has no curvature at the corners. Any descent direction from a corner along the surface immediately increases some coordinate from zero — so staying at the corner is optimal unless the gradient strongly points away from the axis.
The key geometric insight: L1 balls have corners on the axes, while L2 balls have smooth surfaces. When loss contours intersect these shapes, they're much more likely to hit L1 corners than L2 surfaces along axes. At a corner, one or more coordinates are exactly zero, producing sparse solutions. This is why L1 (Lasso) is used for feature selection while L2 (Ridge) just shrinks all weights uniformly.
| Regularizer | Constraint Shape | Corners | Sparsity-Inducing |
|---|---|---|---|
| L2 (Ridge) | Sphere | None (smooth) | No |
| L1 (Lasso) | Cross-polytope | On axes | Yes |
| L∞ | Hypercube | On vertices | Rarely |
| Elastic Net | Rounded diamond | Near axes | Yes (less than L1) |
| Group Lasso | Union of cones | On group boundaries | Group sparsity |
As we vary the regularization strength from $\lambda = 0$ to $\lambda = \infty$ (or equivalently, $C$ from $\infty$ to $0$), the optimal solution traces a path through parameter space. Understanding this regularization path provides insight into how regularization affects model complexity.
The regularization path is the set: $${\hat{w}(\lambda) : \lambda \in [0, \infty)}$$
where $\hat{w}(\lambda) = \arg\min_w [\mathcal{L}(w) + \lambda \Omega(w)]$
Key properties:
For linear regression with L2 regularization: $$\hat{w}(\lambda) = (X^\top X + \lambda I)^{-1} X^\top y$$
Path properties:
Computational note: To compute the path, we can solve for multiple $\lambda$ values efficiently using the SVD decomposition of $X$.
For linear regression with L1 regularization: $$\hat{w}(\lambda) = \arg\min_w \frac{1}{2}|y - Xw|_2^2 + \lambda |w|_1$$
Path properties:
The LARS algorithm: (Least Angle Regression)
The ability to compute entire regularization paths efficiently is practically valuable. For Lasso, the LARS algorithm (Efron et al., 2004) computes the solution for all λ values at once. For Ridge, the closed-form solution enables similar efficiency. This is exploited in cross-validation: compute the path once, then evaluate at many λ values.
The regularization path directly supports model selection:
1. Computational efficiency:
2. Understanding variable importance:
3. Stability analysis:
4. Degrees of freedom:
Another powerful geometric interpretation views regularization as a projection operation, connecting it to the theory of projections in Hilbert spaces.
Consider the constrained form: $$\hat{w}C = \arg\min{w : \Omega(w) \leq C} \mathcal{L}(w)$$
This can be rewritten using orthogonal projection:
Step 1: Compute unconstrained optimum $\hat{w}_{\text{ERM}} = \arg\min_w \mathcal{L}(w)$
Step 2: If $\Omega(\hat{w}_{\text{ERM}}) \leq C$, done. Otherwise:
Step 3: Project onto constraint set: $$\hat{w}C = \text{Proj}{\Omega \leq C}(\hat{w}_{\text{ERM}})$$
For L2 regularization, this projection has a closed form.
For the L2 constraint $|w|_2 \leq \sqrt{C}$:
$$\text{Proj}_{|\cdot|_2 \leq r}(w) = \begin{cases} w & \text{if } |w|_2 \leq r \ r \cdot \frac{w}{|w|_2} & \text{if } |w|_2 > r \end{cases}$$
Interpretation: Project by rescaling the vector to have norm exactly $r$, preserving direction.
Effect: All parameters shrink by the same factor. Relative magnitudes are preserved.
For the L1 constraint $|w|_1 \leq C$, the projection involves the soft-thresholding operator:
$$S_\tau(w_j) = \text{sign}(w_j) \cdot \max(|w_j| - \tau, 0)$$
where $\tau$ is the threshold level (determined by $C$).
Interpretation:
Effect: Small parameters disappear entirely; large parameters shrink by a constant amount. This is the sparsity-inducing property of L1.
Hard thresholding (used in best-subset selection) sets small parameters to zero and leaves large parameters unchanged: $H_τ(w) = w \cdot \mathbb{1}[|w| > τ]$. Soft thresholding (from L1) both zeros small parameters and shrinks large ones. Soft thresholding is continuous and convex; hard thresholding is discontinuous and leads to NP-hard optimization.
The projection interpretation suggests a natural optimization algorithm:
Projected Gradient Descent:
Initialize w⁰
For t = 0, 1, 2, ...:
w^{t+1/2} = w^t - η ∇L(w^t) # Gradient step
w^{t+1} = Proj_{Ω ≤ C}(w^{t+1/2}) # Project back
Properties:
Proximal perspective: The projection can be viewed as solving: $$\text{Proj}{\Omega \leq C}(z) = \arg\min{w : \Omega(w) \leq C} |w - z|_2^2$$
This connects to proximal gradient methods, a powerful framework for optimization with non-smooth regularizers.
Understanding the constraint view suggests principles for designing effective regularizers based on the desired properties of solutions.
The geometry of the constraint set ${w : \Omega(w) \leq C}$ directly determines solution characteristics:
| Constraint Shape | Solution Property | Example |
|---|---|---|
| Spherical (L2) | Uniform shrinkage | Ridge regression |
| Diamond (L1) | Coordinate sparsity | Lasso |
| Cylindrical | Sparsity in some coordinates | Partial penalization |
| Polyhedral | Sparse, structured | Group Lasso variants |
| Non-convex | Many local minima | SCAD, MCP |
Convex constraints lead to:
Non-convex constraints may provide:
Practical recommendation: Start with convex regularizers (L1, L2, Elastic Net). Consider non-convex only if convex solutions are unsatisfactory and you have robust optimization procedures.
The constraint set should encode domain knowledge about the problem:
Sparsity expected: Use L1 or similar (many features irrelevant)
Smoothness expected: Use L2 or Sobolev norms (adjacent features related)
Group structure known: Use Group Lasso (features come in groups)
Monotonicity expected: Use isotonic constraints (response should increase in certain features)
Low-rank expected: Use nuclear norm (for matrix-valued parameters)
The constraint set is a form of inductive bias — it encodes which solutions are a priori plausible. Choosing the right constraint is as important as choosing the right model architecture. Domain expertise about the problem should guide this choice.
The constraint budget $C$ (or equivalently, $\lambda$) must be calibrated to the problem:
Too large $C$ (too small $\lambda$):
Too small $C$ (too large $\lambda$):
Optimal $C$:
The constraint perspective on regularization has important practical implications for implementation and tuning.
When to use penalized form ($\lambda$):
When to use constrained form ($C$):
Cross-validation: Most common approach
Information criteria: AIC, BIC $$\text{AIC} = 2k - 2\ln(\hat{L})$$ $$\text{BIC} = k\ln(n) - 2\ln(\hat{L})$$
Bayesian approaches:
Regularization penalizes parameter magnitudes, so feature scaling critically affects results. If feature x₁ ranges [0,1] and x₂ ranges [0,1000], their coefficients have different scales. Standard practice: standardize to zero mean, unit variance before regularization. The intercept is typically not penalized.
We have developed a comprehensive understanding of regularization from the optimization perspective. Let us consolidate the key insights:
The constraint view provides optimization-theoretic insights. In the next page, we explore an entirely complementary perspective: Regularization as Prior. This Bayesian view interprets the regularizer as encoding prior beliefs about parameter distributions, connecting regularization to fundamental principles of probabilistic inference.
You now understand regularization from the optimization perspective: the equivalence of penalized and constrained forms, the geometry of constraint sets, the regularization path, and practical implementation considerations. This foundation complements the Bayesian view of the next page.