Machine LearningAdvanced Regularization

Advanced Regularization Techniques

LevelAdvanced

Duration90 mins

TopicAdvanced Regularization

3 / 5

Bridge Regression

The Lq Penalty Family

Ridge Regression uses the L2 norm: ||β||₂² = Σⱼβⱼ². Lasso uses the L1 norm: ||β||₁ = Σⱼ|βⱼ|. But these are just two points on a continuous spectrum of Lq norm penalties:

$$|\boldsymbol{\beta}|q = \left(\sum{j=1}^{p} |\beta_j|^q\right)^{1/q}$$

Bridge Regression generalizes regularized regression to arbitrary q > 0:

$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \lambda \sum{j=1}^{p} |\beta_j|^q$$

This seemingly simple generalization unlocks profound geometric insights and reveals why q = 1 (Lasso) is special for sparsity, while also opening the door to non-convex penalties (0 < q < 1) with even stronger sparsity-inducing properties.

What You Will Learn

By the end of this page, you will understand how the Lq norm parameter controls the geometry of the constraint region, why q = 1 is the convexity boundary, the theoretical properties of different q values, and when sub-L1 penalties (0 < q < 1) offer advantages despite their non-convexity.

The Lq Norm Spectrum

Definition and Special Cases

For q > 0, the Lq (quasi-)norm of β ∈ ℝᵖ is:

$$|\boldsymbol{\beta}|q = \left(\sum{j=1}^{p} |\beta_j|^q\right)^{1/q}$$

Important special cases include:

Lq Norms and Their Properties
q Value	Name	Penalty Form	Key Property
q → 0	L0 'norm'	Count of nonzeros	Exact sparsity (NP-hard)
q = 0.5	L0.5	Σ√\|βⱼ\|	Strong sparsity, non-convex
q = 1	L1 / Lasso	Σ\|βⱼ\|	Sparsity + convexity boundary
q = 2	L2 / Ridge	Σβⱼ²	Shrinkage, no sparsity
q → ∞	L∞	max\|βⱼ\|	Bounds maximum coefficient

The L0 'Norm' (Limiting Case)

As q → 0, the Lq norm approaches the L0 'norm':

$$|\boldsymbol{\beta}|_0 = #{j : \beta_j \neq 0}$$

This simply counts nonzero coefficients—the most direct measure of sparsity. The ideal sparse regression would minimize:

$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_0$$

Unfortunately, this is NP-hard in general—requiring exhaustive search over 2ᵖ possible support sets. Lasso (q = 1) is the tightest convex relaxation of this intractable problem.

Why 'Quasi-Norm' for q < 1?

For q < 1, ||·||_q violates the triangle inequality, disqualifying it as a true norm. However, the term 'Lq norm' is commonly used even in this range. Mathematically, it's a quasi-norm, but this distinction rarely affects practical applications.

Geometric Interpretation

Constraint Region Shape

The unit ball Bq = {β : ||β||_q ≤ 1} changes shape dramatically with q:

In 2D (β₁, β₂):

q = 2: Circle (smooth everywhere)
q = 1: Diamond/square rotated 45° (corners at axes)
q = 0.5: Star shape with concave sides (cusps at axes)
q → 0: Approaches the coordinate axes only (extreme sparsity)
q → ∞: Square aligned with axes (bounds individual coordinates)

The transition from q > 1 to q < 1 is critical: the unit ball changes from convex to non-convex.

Why Corners Induce Sparsity

Recall the geometric view of regularization: we seek where elliptical loss contours first touch the constraint region.

For q > 1: The constraint region is smooth everywhere. Contact typically occurs at an interior point (no zeros) unless the loss is specially aligned.
For q = 1: Corners exist at the coordinate axes. Contact at a corner sets one or more coordinates exactly to zero.
For q < 1: The constraint region is non-convex with even sharper 'cusps' at the axes. These cusps make axis-aligned contact even more likely.

The Convexity Threshold at q = 1

The L1 norm (q = 1) is special because it sits at the boundary between convex and non-convex:

q ≥ 1: The unit ball is convex → optimization is convex → unique global minimum
q < 1: The unit ball is non-convex → optimization is non-convex → multiple local minima possible

This explains why Lasso is so popular: it achieves sparsity while maintaining computational tractability through convexity.

The Sparsity-Convexity Tradeoff

Smaller q → stronger sparsity induction but harder optimization. q = 1 (Lasso) is the 'sweet spot'—the smallest q that maintains convexity. Going below q = 1 gains sparsity at the cost of computational complexity and potential local minima.

Theoretical Properties

Sparsity as a Function of q

The degree of sparsity in the Bridge solution depends on q:

q = 2 (Ridge): All coefficients are nonzero (shrunk but not sparse)
q = 1 (Lasso): Some coefficients exactly zero, continuous shrinkage
0 < q < 1: Stronger sparsity, but estimates are biased and discontinuous

Bias Properties

An important distinction lies in the asymptotic bias:

q ≥ 1: The penalty introduces bias that persists even as n → ∞. Large true coefficients are shrunk.
q < 1: Under certain conditions, the estimator can be asymptotically unbiased for the nonzero coefficients while maintaining sparsity.

This is sometimes called the oracle property: the estimator asymptotically behaves as if we knew which coefficients were truly zero (and excluded them) and which were nonzero (and estimated them without penalty).

Key Theoretical Results

•q > 1: Non-sparse solutions; provides regularization but not variable selection
•q = 1: Sparse but biased; selection consistency requires special conditions (irrepresentability)
•0 < q < 1: Can achieve oracle property with weaker conditions; trades computational tractability for statistical efficiency
•q → 0: Approaches best subset selection; NP-hard in general

The Lasso Bias Problem

Lasso's L1 penalty applies the same shrinkage rate to all coefficients. Large coefficients are over-shrunk, creating bias. Sub-L1 penalties (q < 1) address this by applying less penalty to larger coefficients, but at the cost of non-convexity.

Sub-L1 Penalties in Practice

The L0.5 Penalty

The L0.5 penalty (||β||₀.₅ = Σ√|βⱼ|) has received particular attention:

$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \lambda \sum{j=1}^{p} \sqrt{|\beta_j|}$$

Advantages:

Produces sparser solutions than Lasso
Less bias for large coefficients
Can achieve model selection consistency under weaker conditions

Disadvantages:

Non-convex optimization (local minima)
No closed-form proximal operator
Sensitive to initialization

bridge_regression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from scipy.optimize import minimize
 
def bridge_penalty(beta, q, lambda_param):
    """Compute Lq penalty: lambda * sum(|beta|^q)."""
    return lambda_param * np.sum(np.abs(beta) ** q)
 
def bridge_objective(beta, X, y, q, lambda_param):
    """Bridge regression objective function."""
    residual = y - X @ beta
    loss = 0.5 * np.sum(residual ** 2)
    penalty = bridge_penalty(beta, q, lambda_param)
    return loss + penalty
 
def bridge_gradient(beta, X, y, q, lambda_param):
    """Gradient of bridge objective (for q >= 1)."""
    residual = y - X @ beta
    grad_loss = -X.T @ residual
    
    # Gradient of penalty (subgradient at 0)
    grad_penalty = lambda_param * q * np.sign(beta) * (np.abs(beta) ** (q - 1))
    grad_penalty[beta == 0] = 0  # Subgradient convention
    
    return grad_loss + grad_penalty
 
def fit_bridge_regression(X, y, q, lambda_param, method='L-BFGS-B'):
    """
    Fit Bridge Regression with Lq penalty.
    
    Parameters:
    -----------
    X : ndarray (n, p) - Design matrix
    y : ndarray (n,) - Response
    q : float - Norm parameter (q > 0)
    lambda_param : float - Regularization strength
    method : str - Optimization method
    
    Returns:
    --------
    beta : ndarray (p,) - Fitted coefficients
    """
    n, p = X.shape
    beta_init = np.zeros(p)
    
    if q >= 1:
        # Convex case: gradient-based optimization
        result = minimize(
            bridge_objective,
            beta_init,
            args=(X, y, q, lambda_param),
            method=method,
            jac=lambda b: bridge_gradient(b, X, y, q, lambda_param)
        )
    else:
        # Non-convex case: try multiple initializations
        best_obj = np.inf
        best_beta = beta_init
        
        for _ in range(10):
            init = np.random.randn(p) * 0.1
            result = minimize(
                bridge_objective,
                init,
                args=(X, y, q, lambda_param),
                method='Nelder-Mead'
            )
            if result.fun < best_obj:
                best_obj = result.fun
                best_beta = result.x
        
        return best_beta
    
    return result.x

Optimization Strategies for q < 1

Non-convexity requires special techniques:

Multiple random restarts: Run optimization from many random initializations, keep the best solution
Warm starting from Lasso: Initialize with Lasso solution, then optimize the non-convex objective
Iteratively Reweighted Lasso: Approximate the Lq penalty locally as weighted L1, iterate until convergence
Majorization-Minimization (MM): Construct convex upper bounds and minimize them iteratively
Local Linear Approximation (LLA): At each iteration, approximate |βⱼ|^q ≈ q|βⱼ⁽ᵏ⁾|^(q-1)|βⱼ| and solve weighted Lasso

Comparison and Selection

When to Use Different q Values

The choice of q depends on your priorities:

Choosing the Right q
Priority	Recommended q	Reasoning
Computational simplicity	q = 2 (Ridge)	Closed-form solution, no tuning beyond λ
Sparsity + convexity	q = 1 (Lasso)	Best sparsity while maintaining tractability
Lower bias for large coefficients	0.5 ≤ q < 1	Reduces Lasso's over-shrinkage at cost of convexity
Maximum sparsity (small p)	q → 0	Approximates best subset; only for small problems
Grouped predictors + sparsity	Elastic Net	Combine q=1 and q=2 rather than single q

Practical Recommendation

Start with Lasso (q = 1). If bias is a concern and the problem scale permits, consider sub-L1 penalties with warm-start initialization from Lasso. For most applications, the bias-variance tradeoff of Lasso is favorable, and non-convex complications aren't worth the marginal gains.

Summary

Key Takeaways

•Bridge Regression uses Lq penalties (||β||_q^q), unifying Ridge (q=2), Lasso (q=1), and beyond
•q = 1 is the convexity boundary: q ≥ 1 is convex; q < 1 is non-convex with local minima
•Geometry determines sparsity: Smaller q creates sharper corners/cusps, inducing stronger sparsity
•Sub-L1 penalties (0 < q < 1) offer reduced bias and potential oracle properties, but require non-convex optimization
•Practical choice: Lasso remains the default; sub-L1 for bias-sensitive applications with computational budget

Page Complete

You now understand Bridge Regression and the Lq penalty family—the theoretical foundation connecting all shrinkage methods. Next, we'll explore non-convex penalties like SCAD and MCP that achieve near-unbiased estimation while maintaining computational tractability through clever penalty design.

3 / 5

Loading learning content...

Machine LearningAdvanced Regularization

Advanced Regularization Techniques

LevelAdvanced

Duration90 mins

TopicAdvanced Regularization

3 / 5

Bridge Regression

The Lq Penalty Family

Ridge Regression uses the L2 norm: ||β||₂² = Σⱼβⱼ². Lasso uses the L1 norm: ||β||₁ = Σⱼ|βⱼ|. But these are just two points on a continuous spectrum of Lq norm penalties:

$$|\boldsymbol{\beta}|q = \left(\sum{j=1}^{p} |\beta_j|^q\right)^{1/q}$$

Bridge Regression generalizes regularized regression to arbitrary q > 0:

$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \lambda \sum{j=1}^{p} |\beta_j|^q$$

What You Will Learn

The Lq Norm Spectrum

Definition and Special Cases

For q > 0, the Lq (quasi-)norm of β ∈ ℝᵖ is:

$$|\boldsymbol{\beta}|q = \left(\sum{j=1}^{p} |\beta_j|^q\right)^{1/q}$$

Important special cases include:

Lq Norms and Their Properties
q Value	Name	Penalty Form	Key Property
q → 0	L0 'norm'	Count of nonzeros	Exact sparsity (NP-hard)
q = 0.5	L0.5	Σ√\|βⱼ\|	Strong sparsity, non-convex
q = 1	L1 / Lasso	Σ\|βⱼ\|	Sparsity + convexity boundary
q = 2	L2 / Ridge	Σβⱼ²	Shrinkage, no sparsity
q → ∞	L∞	max\|βⱼ\|	Bounds maximum coefficient

The L0 'Norm' (Limiting Case)

As q → 0, the Lq norm approaches the L0 'norm':

$$|\boldsymbol{\beta}|_0 = #{j : \beta_j \neq 0}$$

This simply counts nonzero coefficients—the most direct measure of sparsity. The ideal sparse regression would minimize:

$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_0$$

Unfortunately, this is NP-hard in general—requiring exhaustive search over 2ᵖ possible support sets. Lasso (q = 1) is the tightest convex relaxation of this intractable problem.

Why 'Quasi-Norm' for q < 1?

Geometric Interpretation

Constraint Region Shape

The unit ball Bq = {β : ||β||_q ≤ 1} changes shape dramatically with q:

In 2D (β₁, β₂):

q = 2: Circle (smooth everywhere)
q = 1: Diamond/square rotated 45° (corners at axes)
q = 0.5: Star shape with concave sides (cusps at axes)
q → 0: Approaches the coordinate axes only (extreme sparsity)
q → ∞: Square aligned with axes (bounds individual coordinates)

The transition from q > 1 to q < 1 is critical: the unit ball changes from convex to non-convex.

Why Corners Induce Sparsity

Recall the geometric view of regularization: we seek where elliptical loss contours first touch the constraint region.

For q > 1: The constraint region is smooth everywhere. Contact typically occurs at an interior point (no zeros) unless the loss is specially aligned.
For q = 1: Corners exist at the coordinate axes. Contact at a corner sets one or more coordinates exactly to zero.
For q < 1: The constraint region is non-convex with even sharper 'cusps' at the axes. These cusps make axis-aligned contact even more likely.

The Convexity Threshold at q = 1

The L1 norm (q = 1) is special because it sits at the boundary between convex and non-convex:

q ≥ 1: The unit ball is convex → optimization is convex → unique global minimum
q < 1: The unit ball is non-convex → optimization is non-convex → multiple local minima possible

This explains why Lasso is so popular: it achieves sparsity while maintaining computational tractability through convexity.

The Sparsity-Convexity Tradeoff

Theoretical Properties

Sparsity as a Function of q

The degree of sparsity in the Bridge solution depends on q:

q = 2 (Ridge): All coefficients are nonzero (shrunk but not sparse)
q = 1 (Lasso): Some coefficients exactly zero, continuous shrinkage
0 < q < 1: Stronger sparsity, but estimates are biased and discontinuous

Bias Properties

An important distinction lies in the asymptotic bias:

q ≥ 1: The penalty introduces bias that persists even as n → ∞. Large true coefficients are shrunk.
q < 1: Under certain conditions, the estimator can be asymptotically unbiased for the nonzero coefficients while maintaining sparsity.

Key Theoretical Results

•q > 1: Non-sparse solutions; provides regularization but not variable selection
•q = 1: Sparse but biased; selection consistency requires special conditions (irrepresentability)
•0 < q < 1: Can achieve oracle property with weaker conditions; trades computational tractability for statistical efficiency
•q → 0: Approaches best subset selection; NP-hard in general

The Lasso Bias Problem

Sub-L1 Penalties in Practice

The L0.5 Penalty

The L0.5 penalty (||β||₀.₅ = Σ√|βⱼ|) has received particular attention:

$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \lambda \sum{j=1}^{p} \sqrt{|\beta_j|}$$

Advantages:

Produces sparser solutions than Lasso
Less bias for large coefficients
Can achieve model selection consistency under weaker conditions

Disadvantages:

Non-convex optimization (local minima)
No closed-form proximal operator
Sensitive to initialization

bridge_regression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from scipy.optimize import minimize
 
def bridge_penalty(beta, q, lambda_param):
    """Compute Lq penalty: lambda * sum(|beta|^q)."""
    return lambda_param * np.sum(np.abs(beta) ** q)
 
def bridge_objective(beta, X, y, q, lambda_param):
    """Bridge regression objective function."""
    residual = y - X @ beta
    loss = 0.5 * np.sum(residual ** 2)
    penalty = bridge_penalty(beta, q, lambda_param)
    return loss + penalty
 
def bridge_gradient(beta, X, y, q, lambda_param):
    """Gradient of bridge objective (for q >= 1)."""
    residual = y - X @ beta
    grad_loss = -X.T @ residual
    
    # Gradient of penalty (subgradient at 0)
    grad_penalty = lambda_param * q * np.sign(beta) * (np.abs(beta) ** (q - 1))
    grad_penalty[beta == 0] = 0  # Subgradient convention
    
    return grad_loss + grad_penalty
 
def fit_bridge_regression(X, y, q, lambda_param, method='L-BFGS-B'):
    """
    Fit Bridge Regression with Lq penalty.
    
    Parameters:
    -----------
    X : ndarray (n, p) - Design matrix
    y : ndarray (n,) - Response
    q : float - Norm parameter (q > 0)
    lambda_param : float - Regularization strength
    method : str - Optimization method
    
    Returns:
    --------
    beta : ndarray (p,) - Fitted coefficients
    """
    n, p = X.shape
    beta_init = np.zeros(p)
    
    if q >= 1:
        # Convex case: gradient-based optimization
        result = minimize(
            bridge_objective,
            beta_init,
            args=(X, y, q, lambda_param),
            method=method,
            jac=lambda b: bridge_gradient(b, X, y, q, lambda_param)
        )
    else:
        # Non-convex case: try multiple initializations
        best_obj = np.inf
        best_beta = beta_init
        
        for _ in range(10):
            init = np.random.randn(p) * 0.1
            result = minimize(
                bridge_objective,
                init,
                args=(X, y, q, lambda_param),
                method='Nelder-Mead'
            )
            if result.fun < best_obj:
                best_obj = result.fun
                best_beta = result.x
        
        return best_beta
    
    return result.x

Optimization Strategies for q < 1

Non-convexity requires special techniques:

Multiple random restarts: Run optimization from many random initializations, keep the best solution
Warm starting from Lasso: Initialize with Lasso solution, then optimize the non-convex objective
Iteratively Reweighted Lasso: Approximate the Lq penalty locally as weighted L1, iterate until convergence
Majorization-Minimization (MM): Construct convex upper bounds and minimize them iteratively
Local Linear Approximation (LLA): At each iteration, approximate |βⱼ|^q ≈ q|βⱼ⁽ᵏ⁾|^(q-1)|βⱼ| and solve weighted Lasso

Comparison and Selection

When to Use Different q Values

The choice of q depends on your priorities:

Choosing the Right q
Priority	Recommended q	Reasoning
Computational simplicity	q = 2 (Ridge)	Closed-form solution, no tuning beyond λ
Sparsity + convexity	q = 1 (Lasso)	Best sparsity while maintaining tractability
Lower bias for large coefficients	0.5 ≤ q < 1	Reduces Lasso's over-shrinkage at cost of convexity
Maximum sparsity (small p)	q → 0	Approximates best subset; only for small problems
Grouped predictors + sparsity	Elastic Net	Combine q=1 and q=2 rather than single q

Practical Recommendation

Summary

Key Takeaways

•Bridge Regression uses Lq penalties (||β||_q^q), unifying Ridge (q=2), Lasso (q=1), and beyond
•q = 1 is the convexity boundary: q ≥ 1 is convex; q < 1 is non-convex with local minima
•Geometry determines sparsity: Smaller q creates sharper corners/cusps, inducing stronger sparsity
•Sub-L1 penalties (0 < q < 1) offer reduced bias and potential oracle properties, but require non-convex optimization
•Practical choice: Lasso remains the default; sub-L1 for bias-sensitive applications with computational budget

Page Complete

3 / 5