Loading learning content...
Ridge Regression uses the L2 norm: ||β||₂² = Σⱼβⱼ². Lasso uses the L1 norm: ||β||₁ = Σⱼ|βⱼ|. But these are just two points on a continuous spectrum of Lq norm penalties:
$$|\boldsymbol{\beta}|q = \left(\sum{j=1}^{p} |\beta_j|^q\right)^{1/q}$$
Bridge Regression generalizes regularized regression to arbitrary q > 0:
$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \lambda \sum{j=1}^{p} |\beta_j|^q$$
This seemingly simple generalization unlocks profound geometric insights and reveals why q = 1 (Lasso) is special for sparsity, while also opening the door to non-convex penalties (0 < q < 1) with even stronger sparsity-inducing properties.
By the end of this page, you will understand how the Lq norm parameter controls the geometry of the constraint region, why q = 1 is the convexity boundary, the theoretical properties of different q values, and when sub-L1 penalties (0 < q < 1) offer advantages despite their non-convexity.
For q > 0, the Lq (quasi-)norm of β ∈ ℝᵖ is:
$$|\boldsymbol{\beta}|q = \left(\sum{j=1}^{p} |\beta_j|^q\right)^{1/q}$$
Important special cases include:
| q Value | Name | Penalty Form | Key Property |
|---|---|---|---|
| q → 0 | L0 'norm' | Count of nonzeros | Exact sparsity (NP-hard) |
| q = 0.5 | L0.5 | Σ√|βⱼ| | Strong sparsity, non-convex |
| q = 1 | L1 / Lasso | Σ|βⱼ| | Sparsity + convexity boundary |
| q = 2 | L2 / Ridge | Σβⱼ² | Shrinkage, no sparsity |
| q → ∞ | L∞ | max|βⱼ| | Bounds maximum coefficient |
As q → 0, the Lq norm approaches the L0 'norm':
$$|\boldsymbol{\beta}|_0 = #{j : \beta_j \neq 0}$$
This simply counts nonzero coefficients—the most direct measure of sparsity. The ideal sparse regression would minimize:
$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_0$$
Unfortunately, this is NP-hard in general—requiring exhaustive search over 2ᵖ possible support sets. Lasso (q = 1) is the tightest convex relaxation of this intractable problem.
For q < 1, ||·||_q violates the triangle inequality, disqualifying it as a true norm. However, the term 'Lq norm' is commonly used even in this range. Mathematically, it's a quasi-norm, but this distinction rarely affects practical applications.
The unit ball Bq = {β : ||β||_q ≤ 1} changes shape dramatically with q:
In 2D (β₁, β₂):
The transition from q > 1 to q < 1 is critical: the unit ball changes from convex to non-convex.
Recall the geometric view of regularization: we seek where elliptical loss contours first touch the constraint region.
For q > 1: The constraint region is smooth everywhere. Contact typically occurs at an interior point (no zeros) unless the loss is specially aligned.
For q = 1: Corners exist at the coordinate axes. Contact at a corner sets one or more coordinates exactly to zero.
For q < 1: The constraint region is non-convex with even sharper 'cusps' at the axes. These cusps make axis-aligned contact even more likely.
The L1 norm (q = 1) is special because it sits at the boundary between convex and non-convex:
This explains why Lasso is so popular: it achieves sparsity while maintaining computational tractability through convexity.
Smaller q → stronger sparsity induction but harder optimization. q = 1 (Lasso) is the 'sweet spot'—the smallest q that maintains convexity. Going below q = 1 gains sparsity at the cost of computational complexity and potential local minima.
The degree of sparsity in the Bridge solution depends on q:
An important distinction lies in the asymptotic bias:
This is sometimes called the oracle property: the estimator asymptotically behaves as if we knew which coefficients were truly zero (and excluded them) and which were nonzero (and estimated them without penalty).
Lasso's L1 penalty applies the same shrinkage rate to all coefficients. Large coefficients are over-shrunk, creating bias. Sub-L1 penalties (q < 1) address this by applying less penalty to larger coefficients, but at the cost of non-convexity.
The L0.5 penalty (||β||₀.₅ = Σ√|βⱼ|) has received particular attention:
$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \lambda \sum{j=1}^{p} \sqrt{|\beta_j|}$$
Advantages:
Disadvantages:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import numpy as npfrom scipy.optimize import minimize def bridge_penalty(beta, q, lambda_param): """Compute Lq penalty: lambda * sum(|beta|^q).""" return lambda_param * np.sum(np.abs(beta) ** q) def bridge_objective(beta, X, y, q, lambda_param): """Bridge regression objective function.""" residual = y - X @ beta loss = 0.5 * np.sum(residual ** 2) penalty = bridge_penalty(beta, q, lambda_param) return loss + penalty def bridge_gradient(beta, X, y, q, lambda_param): """Gradient of bridge objective (for q >= 1).""" residual = y - X @ beta grad_loss = -X.T @ residual # Gradient of penalty (subgradient at 0) grad_penalty = lambda_param * q * np.sign(beta) * (np.abs(beta) ** (q - 1)) grad_penalty[beta == 0] = 0 # Subgradient convention return grad_loss + grad_penalty def fit_bridge_regression(X, y, q, lambda_param, method='L-BFGS-B'): """ Fit Bridge Regression with Lq penalty. Parameters: ----------- X : ndarray (n, p) - Design matrix y : ndarray (n,) - Response q : float - Norm parameter (q > 0) lambda_param : float - Regularization strength method : str - Optimization method Returns: -------- beta : ndarray (p,) - Fitted coefficients """ n, p = X.shape beta_init = np.zeros(p) if q >= 1: # Convex case: gradient-based optimization result = minimize( bridge_objective, beta_init, args=(X, y, q, lambda_param), method=method, jac=lambda b: bridge_gradient(b, X, y, q, lambda_param) ) else: # Non-convex case: try multiple initializations best_obj = np.inf best_beta = beta_init for _ in range(10): init = np.random.randn(p) * 0.1 result = minimize( bridge_objective, init, args=(X, y, q, lambda_param), method='Nelder-Mead' ) if result.fun < best_obj: best_obj = result.fun best_beta = result.x return best_beta return result.xNon-convexity requires special techniques:
Multiple random restarts: Run optimization from many random initializations, keep the best solution
Warm starting from Lasso: Initialize with Lasso solution, then optimize the non-convex objective
Iteratively Reweighted Lasso: Approximate the Lq penalty locally as weighted L1, iterate until convergence
Majorization-Minimization (MM): Construct convex upper bounds and minimize them iteratively
Local Linear Approximation (LLA): At each iteration, approximate |βⱼ|^q ≈ q|βⱼ⁽ᵏ⁾|^(q-1)|βⱼ| and solve weighted Lasso
The choice of q depends on your priorities:
| Priority | Recommended q | Reasoning |
|---|---|---|
| Computational simplicity | q = 2 (Ridge) | Closed-form solution, no tuning beyond λ |
| Sparsity + convexity | q = 1 (Lasso) | Best sparsity while maintaining tractability |
| Lower bias for large coefficients | 0.5 ≤ q < 1 | Reduces Lasso's over-shrinkage at cost of convexity |
| Maximum sparsity (small p) | q → 0 | Approximates best subset; only for small problems |
| Grouped predictors + sparsity | Elastic Net | Combine q=1 and q=2 rather than single q |
Start with Lasso (q = 1). If bias is a concern and the problem scale permits, consider sub-L1 penalties with warm-start initialization from Lasso. For most applications, the bias-variance tradeoff of Lasso is favorable, and non-convex complications aren't worth the marginal gains.
You now understand Bridge Regression and the Lq penalty family—the theoretical foundation connecting all shrinkage methods. Next, we'll explore non-convex penalties like SCAD and MCP that achieve near-unbiased estimation while maintaining computational tractability through clever penalty design.