Loading learning content...
Lasso's L1 penalty achieves the remarkable feat of producing sparse solutions through a convex optimization problem. However, it comes with a fundamental limitation: shrinkage bias.
The L1 penalty applies the same shrinkage rate to all coefficients regardless of their magnitude. A coefficient β = 10 is shrunk by the same amount as β = 0.1. This uniform shrinkage causes:
Non-convex penalties like SCAD and MCP address these issues by penalizing small coefficients heavily (for sparsity) while leaving large coefficients nearly unpenalized (to reduce bias). The price: non-convexity introduces potential local minima. The reward: near-oracle statistical properties.
By the end of this page, you will understand SCAD and MCP penalties, their theoretical advantages over Lasso, optimization algorithms for non-convex problems, and practical guidelines for when these advanced methods are worth the additional complexity.
Consider the soft-thresholding operator that defines the Lasso estimator for orthogonal X:
$$\hat{\beta}_j^{\text{Lasso}} = \text{sign}(\hat{\beta}_j^{\text{OLS}}) \cdot (|\hat{\beta}j^{\text{OLS}}| - \lambda)+$$
Every nonzero coefficient is shrunk by exactly λ. For a true coefficient β* = 10 with λ = 1:
For a true coefficient β* = 2:
The penalty doesn't distinguish signal strength — weak signals face the same absolute shrinkage as strong signals.
Fan and Li (2001) proved that no penalty can simultaneously satisfy unbiasedness, sparsity, and convexity. Something must give. SCAD and MCP sacrifice convexity to achieve approximate unbiasedness and exact sparsity.
The SCAD penalty (Fan and Li, 2001) starts like Lasso but flattens for large coefficients:
$$p_{\lambda}(|\beta|) = \begin{cases} \lambda |\beta| & \text{if } |\beta| \leq \lambda \\ -\frac{|\beta|^2 - 2a\lambda|\beta| + \lambda^2}{2(a-1)} & \text{if } \lambda < |\beta| \leq a\lambda \\ \frac{(a+1)\lambda^2}{2} & \text{if } |\beta| > a\lambda \end{cases}$$
where a > 2 is a shape parameter (typically a = 3.7 based on Bayesian arguments).
| |β| | Lasso Penalty | SCAD Penalty | SCAD Advantage |
|---|---|---|---|
| 0.5λ | 0.5λ² | 0.5λ² | Same (both induce sparsity) |
| 2λ | 2λ² | ~1.5λ² (reduced) | Less shrinkage |
| 5λ | 5λ² | ~4.2λ² (constant) | No additional penalty |
| 10λ | 10λ² | ~4.2λ² (constant) | Large coefficients unpenalized |
The SCAD penalty derivative reveals its behavior:
$$p'{\lambda}(|\beta|) = \lambda \left{ I(|\beta| \leq \lambda) + \frac{(a\lambda - |\beta|)+}{(a-1)\lambda} I(|\beta| > \lambda) \right}$$
This vanishing derivative for large coefficients is why SCAD produces nearly unbiased estimates for large signals.
Fan and Li derived a = 3.7 from a Bayesian perspective by minimizing Stein's unbiased risk estimate (SURE). In practice, results are relatively insensitive to a in the range [3, 4]. Values too close to 2 make the penalty too aggressive; large values approach Lasso behavior.
The Minimax Concave Penalty (Zhang, 2010) offers similar properties to SCAD with a simpler form:
$$p_{\lambda,\gamma}(|\beta|) = \begin{cases} \lambda|\beta| - \frac{\beta^2}{2\gamma} & \text{if } |\beta| \leq \gamma\lambda \\ \frac{\gamma\lambda^2}{2} & \text{if } |\beta| > \gamma\lambda \end{cases}$$
where γ > 1 controls the transition point (similar to a in SCAD).
$$p'{\lambda,\gamma}(|\beta|) = \left(\lambda - \frac{|\beta|}{\gamma}\right)+$$
This is simply a linear function that decreases from λ at |β| = 0 to 0 at |β| = γλ, then stays at 0.
Zhang showed that MCP is minimax concave: among all penalties that satisfy certain regularity conditions and achieve unbiasedness beyond threshold γλ, MCP has the least concavity. This matters because:
MCP is as close to convex as possible while still achieving the desired sparsity and unbiasedness properties.
The most elegant approach approximates the non-convex penalty locally as a weighted L1 penalty:
$$p_{\lambda}(|\beta_j|) \approx p_{\lambda}(|\beta_j^{(k)}|) + p'_{\lambda}(|\beta_j^{(k)}|)(|\beta_j| - |\beta_j^{(k)}|)$$
Dropping constants, this gives a weighted Lasso subproblem:
$$\min_{\boldsymbol{\beta}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \sum{j=1}^{p} w_j^{(k)} |\beta_j|$$
where w_j^(k) = p'_λ(|β_j^(k)|) are weights derived from the current solution.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npfrom sklearn.linear_model import Lasso def scad_derivative(beta_abs, lambda_param, a=3.7): """Compute SCAD penalty derivative.""" if beta_abs <= lambda_param: return lambda_param elif beta_abs <= a * lambda_param: return (a * lambda_param - beta_abs) / (a - 1) else: return 0.0 def mcp_derivative(beta_abs, lambda_param, gamma=3.0): """Compute MCP penalty derivative.""" return max(lambda_param - beta_abs / gamma, 0.0) def fit_ncvx_lla(X, y, lambda_param, penalty='scad', a=3.7, gamma=3.0, max_iter=100, tol=1e-6): """ Fit non-convex penalized regression via Local Linear Approximation. Parameters: ----------- X : ndarray (n, p) - Design matrix y : ndarray (n,) - Response lambda_param : float - Regularization strength penalty : str - 'scad' or 'mcp' a : float - SCAD parameter gamma : float - MCP parameter max_iter : int - Maximum LLA iterations tol : float - Convergence tolerance Returns: -------- beta : ndarray (p,) - Fitted coefficients """ n, p = X.shape # Initialize with Lasso solution (warm start) lasso = Lasso(alpha=lambda_param / n, fit_intercept=False) lasso.fit(X, y) beta = lasso.coef_.copy() for iteration in range(max_iter): beta_old = beta.copy() # Compute adaptive weights from penalty derivatives weights = np.zeros(p) for j in range(p): beta_abs = np.abs(beta[j]) + 1e-10 # Avoid division by zero if penalty == 'scad': weights[j] = scad_derivative(beta_abs, lambda_param, a) else: # mcp weights[j] = mcp_derivative(beta_abs, lambda_param, gamma) # Solve weighted Lasso subproblem # Scale X by inverse weights for weighted penalty effect active = weights > 1e-10 if not np.any(active): break # Coordinate descent update for weighted Lasso for j in range(p): if weights[j] < 1e-10: continue # No penalty, use OLS # Partial residual r_j = y - X @ beta + X[:, j] * beta[j] # Soft thresholding with adaptive weight z_j = X[:, j].T @ r_j / n threshold = weights[j] norm_sq = np.sum(X[:, j]**2) / n beta[j] = np.sign(z_j) * max(np.abs(z_j) - threshold, 0) / norm_sq # Check convergence if np.linalg.norm(beta - beta_old) < tol: print(f"LLA converged at iteration {iteration}") break return betaLLA for SCAD/MCP has favorable convergence properties:
The one-step estimator property is remarkable: initialized at the Lasso solution, one LLA iteration often suffices for near-optimal performance.
In practice, SCAD and MCP perform similarly. Key considerations:
An estimator has the oracle property if it asymptotically:
where S is the true support and Σ* is the asymptotic covariance of the oracle OLS estimator that knows S.
In other words, the estimator behaves as if we knew which coefficients were truly zero — the oracle knowledge that motivates the name.
Standard Lasso fails both conditions in general. It doesn't consistently select the correct support (requires irrepresentability conditions), and its nonzero coefficient estimates are biased. SCAD and MCP achieve oracle properties under much weaker conditions.
SCAD and MCP achieve oracle properties under:
Critically, they do not require the irrepresentability condition needed for Lasso's model selection consistency — a substantially weaker requirement.
Oracle properties matter practically when:
You now understand non-convex penalties—SCAD and MCP—and their theoretical and practical advantages over Lasso. Next, we'll explore Structured Regularization, which incorporates domain knowledge through penalty design for graphs, hierarchies, and other structural constraints.