Machine LearningAdvanced Regularization

Advanced Regularization Techniques

LevelAdvanced

Duration90 mins

TopicAdvanced Regularization

4 / 5

Non-convex Penalties

Beyond Lasso: Reducing Bias While Maintaining Sparsity

Lasso's L1 penalty achieves the remarkable feat of producing sparse solutions through a convex optimization problem. However, it comes with a fundamental limitation: shrinkage bias.

The L1 penalty applies the same shrinkage rate to all coefficients regardless of their magnitude. A coefficient β = 10 is shrunk by the same amount as β = 0.1. This uniform shrinkage causes:

Over-shrinkage of large coefficients: Important signals are attenuated
Bias that persists asymptotically: Even with infinite data, estimates remain biased
Inconsistent variable selection under certain conditions

Non-convex penalties like SCAD and MCP address these issues by penalizing small coefficients heavily (for sparsity) while leaving large coefficients nearly unpenalized (to reduce bias). The price: non-convexity introduces potential local minima. The reward: near-oracle statistical properties.

What You Will Learn

By the end of this page, you will understand SCAD and MCP penalties, their theoretical advantages over Lasso, optimization algorithms for non-convex problems, and practical guidelines for when these advanced methods are worth the additional complexity.

The Bias Problem with Lasso

Why Lasso Over-Shrinks

Consider the soft-thresholding operator that defines the Lasso estimator for orthogonal X:

$$\hat{\beta}_j^{\text{Lasso}} = \text{sign}(\hat{\beta}_j^{\text{OLS}}) \cdot (|\hat{\beta}j^{\text{OLS}}| - \lambda)+$$

Every nonzero coefficient is shrunk by exactly λ. For a true coefficient β* = 10 with λ = 1:

OLS estimate might be ≈ 10
Lasso estimate: 10 - 1 = 9 (10% bias)

For a true coefficient β* = 2:

OLS estimate might be ≈ 2
Lasso estimate: 2 - 1 = 1 (50% bias)

The penalty doesn't distinguish signal strength — weak signals face the same absolute shrinkage as strong signals.

Desired Penalty Properties

•Unbiasedness: Large coefficients should face minimal penalty (preserve signal)
•Sparsity: Small coefficients should be shrunk exactly to zero (remove noise)
•Continuity: Estimator should be continuous in the data (stability)
•Convexity (ideally): Ensures unique global optimum, though often sacrificed

Fan and Li's Impossibility Result

Fan and Li (2001) proved that no penalty can simultaneously satisfy unbiasedness, sparsity, and convexity. Something must give. SCAD and MCP sacrifice convexity to achieve approximate unbiasedness and exact sparsity.

SCAD: Smoothly Clipped Absolute Deviation

SCAD Definition

The SCAD penalty (Fan and Li, 2001) starts like Lasso but flattens for large coefficients:

$$p_{\lambda}(|\beta|) = \begin{cases} \lambda |\beta| & \text{if } |\beta| \leq \lambda \\ -\frac{|\beta|^2 - 2a\lambda|\beta| + \lambda^2}{2(a-1)} & \text{if } \lambda < |\beta| \leq a\lambda \\ \frac{(a+1)\lambda^2}{2} & \text{if } |\beta| > a\lambda \end{cases}$$

where a > 2 is a shape parameter (typically a = 3.7 based on Bayesian arguments).

Interpreting SCAD's Three Regions

|β| ≤ λ: Behaves exactly like Lasso (linear penalty with slope λ)
λ < |β| ≤ aλ: Quadratic transition zone where penalty slope decreases
|β| > aλ: Constant penalty — large coefficients pay no marginal cost

SCAD vs Lasso Penalty Comparison
\|β\|	Lasso Penalty	SCAD Penalty	SCAD Advantage
0.5λ	0.5λ²	0.5λ²	Same (both induce sparsity)
2λ	2λ²	~1.5λ² (reduced)	Less shrinkage
5λ	5λ²	~4.2λ² (constant)	No additional penalty
10λ	10λ²	~4.2λ² (constant)	Large coefficients unpenalized

SCAD Derivative (The Key Insight)

The SCAD penalty derivative reveals its behavior:

$$p'{\lambda}(|\beta|) = \lambda \left{ I(|\beta| \leq \lambda) + \frac{(a\lambda - |\beta|)+}{(a-1)\lambda} I(|\beta| > \lambda) \right}$$

At |β| = 0: Derivative is λ (same as Lasso)
As |β| → aλ: Derivative approaches 0
Beyond |β| = aλ: Derivative is exactly 0

This vanishing derivative for large coefficients is why SCAD produces nearly unbiased estimates for large signals.

Why a = 3.7?

Fan and Li derived a = 3.7 from a Bayesian perspective by minimizing Stein's unbiased risk estimate (SURE). In practice, results are relatively insensitive to a in the range [3, 4]. Values too close to 2 make the penalty too aggressive; large values approach Lasso behavior.

MCP: Minimax Concave Penalty

MCP Definition

The Minimax Concave Penalty (Zhang, 2010) offers similar properties to SCAD with a simpler form:

$$p_{\lambda,\gamma}(|\beta|) = \begin{cases} \lambda|\beta| - \frac{\beta^2}{2\gamma} & \text{if } |\beta| \leq \gamma\lambda \\ \frac{\gamma\lambda^2}{2} & \text{if } |\beta| > \gamma\lambda \end{cases}$$

where γ > 1 controls the transition point (similar to a in SCAD).

MCP Derivative

$$p'{\lambda,\gamma}(|\beta|) = \left(\lambda - \frac{|\beta|}{\gamma}\right)+$$

This is simply a linear function that decreases from λ at |β| = 0 to 0 at |β| = γλ, then stays at 0.

MCP Advantages

•Simpler closed-form than SCAD
•Minimax optimal among concave penalties
•Clear geometric interpretation
•Single tuning parameter γ after λ

MCP vs SCAD

•Both achieve oracle properties
•MCP transitions more sharply
•SCAD has smoother penalty curve
•Similar performance in practice

The 'Minimax' Property

Zhang showed that MCP is minimax concave: among all penalties that satisfy certain regularity conditions and achieve unbiasedness beyond threshold γλ, MCP has the least concavity. This matters because:

Less concavity → easier optimization (closer to convex)
Less concavity → fewer local minima
Less concavity → more stable coefficient paths

MCP is as close to convex as possible while still achieving the desired sparsity and unbiasedness properties.

Optimization Algorithms

Local Linear Approximation (LLA)

The most elegant approach approximates the non-convex penalty locally as a weighted L1 penalty:

$$p_{\lambda}(|\beta_j|) \approx p_{\lambda}(|\beta_j^{(k)}|) + p'_{\lambda}(|\beta_j^{(k)}|)(|\beta_j| - |\beta_j^{(k)}|)$$

Dropping constants, this gives a weighted Lasso subproblem:

$$\min_{\boldsymbol{\beta}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \sum{j=1}^{p} w_j^{(k)} |\beta_j|$$

where w_j^(k) = p'_λ(|β_j^(k)|) are weights derived from the current solution.

ncvx_penalties.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from sklearn.linear_model import Lasso
 
def scad_derivative(beta_abs, lambda_param, a=3.7):
    """Compute SCAD penalty derivative."""
    if beta_abs <= lambda_param:
        return lambda_param
    elif beta_abs <= a * lambda_param:
        return (a * lambda_param - beta_abs) / (a - 1)
    else:
        return 0.0
 
def mcp_derivative(beta_abs, lambda_param, gamma=3.0):
    """Compute MCP penalty derivative."""
    return max(lambda_param - beta_abs / gamma, 0.0)
 
def fit_ncvx_lla(X, y, lambda_param, penalty='scad', a=3.7, gamma=3.0,
                  max_iter=100, tol=1e-6):
    """
    Fit non-convex penalized regression via Local Linear Approximation.
    
    Parameters:
    -----------
    X : ndarray (n, p) - Design matrix
    y : ndarray (n,) - Response
    lambda_param : float - Regularization strength
    penalty : str - 'scad' or 'mcp'
    a : float - SCAD parameter
    gamma : float - MCP parameter
    max_iter : int - Maximum LLA iterations
    tol : float - Convergence tolerance
    
    Returns:
    --------
    beta : ndarray (p,) - Fitted coefficients
    """
    n, p = X.shape
    
    # Initialize with Lasso solution (warm start)
    lasso = Lasso(alpha=lambda_param / n, fit_intercept=False)
    lasso.fit(X, y)
    beta = lasso.coef_.copy()
    
    for iteration in range(max_iter):
        beta_old = beta.copy()
        
        # Compute adaptive weights from penalty derivatives
        weights = np.zeros(p)
        for j in range(p):
            beta_abs = np.abs(beta[j]) + 1e-10  # Avoid division by zero
            if penalty == 'scad':
                weights[j] = scad_derivative(beta_abs, lambda_param, a)
            else:  # mcp
                weights[j] = mcp_derivative(beta_abs, lambda_param, gamma)
        
        # Solve weighted Lasso subproblem
        # Scale X by inverse weights for weighted penalty effect
        active = weights > 1e-10
        if not np.any(active):
            break
            
        # Coordinate descent update for weighted Lasso
        for j in range(p):
            if weights[j] < 1e-10:
                continue  # No penalty, use OLS
            
            # Partial residual
            r_j = y - X @ beta + X[:, j] * beta[j]
            
            # Soft thresholding with adaptive weight
            z_j = X[:, j].T @ r_j / n
            threshold = weights[j]
            norm_sq = np.sum(X[:, j]**2) / n
            
            beta[j] = np.sign(z_j) * max(np.abs(z_j) - threshold, 0) / norm_sq
        
        # Check convergence
        if np.linalg.norm(beta - beta_old) < tol:
            print(f"LLA converged at iteration {iteration}")
            break
    
    return beta

Convergence Properties

LLA for SCAD/MCP has favorable convergence properties:

Monotone decrease: The objective decreases at each iteration
Finite convergence to stationary point: Algorithm terminates in finite steps
One-step efficiency: Under conditions, a single LLA iteration starting from √n-consistent initial estimate achieves oracle rate

The one-step estimator property is remarkable: initialized at the Lasso solution, one LLA iteration often suffices for near-optimal performance.

Choosing Between SCAD and MCP

In practice, SCAD and MCP perform similarly. Key considerations:

MCP: Simpler form, slightly easier implementation, marginally faster
SCAD: More established in literature, slightly smoother penalty
Both: Require tuning of shape parameter (a or γ) in addition to λ

Oracle Properties

What is the Oracle Property?

An estimator has the oracle property if it asymptotically:

Correctly identifies the true sparsity pattern: P(support(β̂) = support(β*)) → 1
Achieves efficient estimation for nonzero coefficients: √n(β̂_S - β_S) →ᵈ N(0, Σ)

where S is the true support and Σ* is the asymptotic covariance of the oracle OLS estimator that knows S.

In other words, the estimator behaves as if we knew which coefficients were truly zero — the oracle knowledge that motivates the name.

Lasso Does Not Have Oracle Property

Standard Lasso fails both conditions in general. It doesn't consistently select the correct support (requires irrepresentability conditions), and its nonzero coefficient estimates are biased. SCAD and MCP achieve oracle properties under much weaker conditions.

Conditions for Oracle Properties

SCAD and MCP achieve oracle properties under:

Minimum signal strength: min_{j ∈ S} |β*_j| > c·λ for some c > 0
Regularity conditions: Standard conditions on X (restricted eigenvalue, etc.)
Proper tuning: λ → 0 and λ√n → ∞ as n → ∞

Critically, they do not require the irrepresentability condition needed for Lasso's model selection consistency — a substantially weaker requirement.

Practical Implications

Oracle properties matter practically when:

Accurate coefficient magnitude estimation is important (not just prediction)
Downstream inference or confidence intervals are needed
True support recovery is scientifically meaningful
Correlations among predictors might violate Lasso's conditions

Summary

Key Takeaways

•Lasso's limitation: Uniform shrinkage biases large coefficients; cannot achieve oracle properties in general
•SCAD and MCP penalize small coefficients like Lasso but leave large coefficients nearly unpenalized
•Non-convexity is the price: local minima are possible, but LLA optimization with Lasso initialization works well
•Oracle properties: Under mild conditions, SCAD/MCP correctly identify support and achieve efficient estimation
•Practical guidance: Use when coefficient magnitude accuracy matters; Lasso warm-start mitigates optimization challenges

Page Complete

You now understand non-convex penalties—SCAD and MCP—and their theoretical and practical advantages over Lasso. Next, we'll explore Structured Regularization, which incorporates domain knowledge through penalty design for graphs, hierarchies, and other structural constraints.

4 / 5

Loading learning content...

Machine LearningAdvanced Regularization

Advanced Regularization Techniques

LevelAdvanced

Duration90 mins

TopicAdvanced Regularization

4 / 5

Non-convex Penalties

Beyond Lasso: Reducing Bias While Maintaining Sparsity

Lasso's L1 penalty achieves the remarkable feat of producing sparse solutions through a convex optimization problem. However, it comes with a fundamental limitation: shrinkage bias.

The L1 penalty applies the same shrinkage rate to all coefficients regardless of their magnitude. A coefficient β = 10 is shrunk by the same amount as β = 0.1. This uniform shrinkage causes:

Over-shrinkage of large coefficients: Important signals are attenuated
Bias that persists asymptotically: Even with infinite data, estimates remain biased
Inconsistent variable selection under certain conditions

What You Will Learn

The Bias Problem with Lasso

Why Lasso Over-Shrinks

Consider the soft-thresholding operator that defines the Lasso estimator for orthogonal X:

$$\hat{\beta}_j^{\text{Lasso}} = \text{sign}(\hat{\beta}_j^{\text{OLS}}) \cdot (|\hat{\beta}j^{\text{OLS}}| - \lambda)+$$

Every nonzero coefficient is shrunk by exactly λ. For a true coefficient β* = 10 with λ = 1:

OLS estimate might be ≈ 10
Lasso estimate: 10 - 1 = 9 (10% bias)

For a true coefficient β* = 2:

OLS estimate might be ≈ 2
Lasso estimate: 2 - 1 = 1 (50% bias)

The penalty doesn't distinguish signal strength — weak signals face the same absolute shrinkage as strong signals.

Desired Penalty Properties

•Unbiasedness: Large coefficients should face minimal penalty (preserve signal)
•Sparsity: Small coefficients should be shrunk exactly to zero (remove noise)
•Continuity: Estimator should be continuous in the data (stability)
•Convexity (ideally): Ensures unique global optimum, though often sacrificed

Fan and Li's Impossibility Result

SCAD: Smoothly Clipped Absolute Deviation

SCAD Definition

The SCAD penalty (Fan and Li, 2001) starts like Lasso but flattens for large coefficients:

where a > 2 is a shape parameter (typically a = 3.7 based on Bayesian arguments).

Interpreting SCAD's Three Regions

|β| ≤ λ: Behaves exactly like Lasso (linear penalty with slope λ)
λ < |β| ≤ aλ: Quadratic transition zone where penalty slope decreases
|β| > aλ: Constant penalty — large coefficients pay no marginal cost

SCAD vs Lasso Penalty Comparison
\|β\|	Lasso Penalty	SCAD Penalty	SCAD Advantage
0.5λ	0.5λ²	0.5λ²	Same (both induce sparsity)
2λ	2λ²	~1.5λ² (reduced)	Less shrinkage
5λ	5λ²	~4.2λ² (constant)	No additional penalty
10λ	10λ²	~4.2λ² (constant)	Large coefficients unpenalized

SCAD Derivative (The Key Insight)

The SCAD penalty derivative reveals its behavior:

$$p'{\lambda}(|\beta|) = \lambda \left{ I(|\beta| \leq \lambda) + \frac{(a\lambda - |\beta|)+}{(a-1)\lambda} I(|\beta| > \lambda) \right}$$

At |β| = 0: Derivative is λ (same as Lasso)
As |β| → aλ: Derivative approaches 0
Beyond |β| = aλ: Derivative is exactly 0

This vanishing derivative for large coefficients is why SCAD produces nearly unbiased estimates for large signals.

Why a = 3.7?

MCP: Minimax Concave Penalty

MCP Definition

The Minimax Concave Penalty (Zhang, 2010) offers similar properties to SCAD with a simpler form:

where γ > 1 controls the transition point (similar to a in SCAD).

MCP Derivative

$$p'{\lambda,\gamma}(|\beta|) = \left(\lambda - \frac{|\beta|}{\gamma}\right)+$$

This is simply a linear function that decreases from λ at |β| = 0 to 0 at |β| = γλ, then stays at 0.

MCP Advantages

•Simpler closed-form than SCAD
•Minimax optimal among concave penalties
•Clear geometric interpretation
•Single tuning parameter γ after λ

MCP vs SCAD

•Both achieve oracle properties
•MCP transitions more sharply
•SCAD has smoother penalty curve
•Similar performance in practice

The 'Minimax' Property

Less concavity → easier optimization (closer to convex)
Less concavity → fewer local minima
Less concavity → more stable coefficient paths

MCP is as close to convex as possible while still achieving the desired sparsity and unbiasedness properties.

Optimization Algorithms

Local Linear Approximation (LLA)

The most elegant approach approximates the non-convex penalty locally as a weighted L1 penalty:

$$p_{\lambda}(|\beta_j|) \approx p_{\lambda}(|\beta_j^{(k)}|) + p'_{\lambda}(|\beta_j^{(k)}|)(|\beta_j| - |\beta_j^{(k)}|)$$

Dropping constants, this gives a weighted Lasso subproblem:

$$\min_{\boldsymbol{\beta}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \sum{j=1}^{p} w_j^{(k)} |\beta_j|$$

where w_j^(k) = p'_λ(|β_j^(k)|) are weights derived from the current solution.

ncvx_penalties.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from sklearn.linear_model import Lasso
 
def scad_derivative(beta_abs, lambda_param, a=3.7):
    """Compute SCAD penalty derivative."""
    if beta_abs <= lambda_param:
        return lambda_param
    elif beta_abs <= a * lambda_param:
        return (a * lambda_param - beta_abs) / (a - 1)
    else:
        return 0.0
 
def mcp_derivative(beta_abs, lambda_param, gamma=3.0):
    """Compute MCP penalty derivative."""
    return max(lambda_param - beta_abs / gamma, 0.0)
 
def fit_ncvx_lla(X, y, lambda_param, penalty='scad', a=3.7, gamma=3.0,
                  max_iter=100, tol=1e-6):
    """
    Fit non-convex penalized regression via Local Linear Approximation.
    
    Parameters:
    -----------
    X : ndarray (n, p) - Design matrix
    y : ndarray (n,) - Response
    lambda_param : float - Regularization strength
    penalty : str - 'scad' or 'mcp'
    a : float - SCAD parameter
    gamma : float - MCP parameter
    max_iter : int - Maximum LLA iterations
    tol : float - Convergence tolerance
    
    Returns:
    --------
    beta : ndarray (p,) - Fitted coefficients
    """
    n, p = X.shape
    
    # Initialize with Lasso solution (warm start)
    lasso = Lasso(alpha=lambda_param / n, fit_intercept=False)
    lasso.fit(X, y)
    beta = lasso.coef_.copy()
    
    for iteration in range(max_iter):
        beta_old = beta.copy()
        
        # Compute adaptive weights from penalty derivatives
        weights = np.zeros(p)
        for j in range(p):
            beta_abs = np.abs(beta[j]) + 1e-10  # Avoid division by zero
            if penalty == 'scad':
                weights[j] = scad_derivative(beta_abs, lambda_param, a)
            else:  # mcp
                weights[j] = mcp_derivative(beta_abs, lambda_param, gamma)
        
        # Solve weighted Lasso subproblem
        # Scale X by inverse weights for weighted penalty effect
        active = weights > 1e-10
        if not np.any(active):
            break
            
        # Coordinate descent update for weighted Lasso
        for j in range(p):
            if weights[j] < 1e-10:
                continue  # No penalty, use OLS
            
            # Partial residual
            r_j = y - X @ beta + X[:, j] * beta[j]
            
            # Soft thresholding with adaptive weight
            z_j = X[:, j].T @ r_j / n
            threshold = weights[j]
            norm_sq = np.sum(X[:, j]**2) / n
            
            beta[j] = np.sign(z_j) * max(np.abs(z_j) - threshold, 0) / norm_sq
        
        # Check convergence
        if np.linalg.norm(beta - beta_old) < tol:
            print(f"LLA converged at iteration {iteration}")
            break
    
    return beta

Convergence Properties

LLA for SCAD/MCP has favorable convergence properties:

Monotone decrease: The objective decreases at each iteration
Finite convergence to stationary point: Algorithm terminates in finite steps
One-step efficiency: Under conditions, a single LLA iteration starting from √n-consistent initial estimate achieves oracle rate

The one-step estimator property is remarkable: initialized at the Lasso solution, one LLA iteration often suffices for near-optimal performance.

Choosing Between SCAD and MCP

In practice, SCAD and MCP perform similarly. Key considerations:

MCP: Simpler form, slightly easier implementation, marginally faster
SCAD: More established in literature, slightly smoother penalty
Both: Require tuning of shape parameter (a or γ) in addition to λ

Oracle Properties

What is the Oracle Property?

An estimator has the oracle property if it asymptotically:

Correctly identifies the true sparsity pattern: P(support(β̂) = support(β*)) → 1
Achieves efficient estimation for nonzero coefficients: √n(β̂_S - β_S) →ᵈ N(0, Σ)

where S is the true support and Σ* is the asymptotic covariance of the oracle OLS estimator that knows S.

In other words, the estimator behaves as if we knew which coefficients were truly zero — the oracle knowledge that motivates the name.

Lasso Does Not Have Oracle Property

Conditions for Oracle Properties

SCAD and MCP achieve oracle properties under:

Minimum signal strength: min_{j ∈ S} |β*_j| > c·λ for some c > 0
Regularity conditions: Standard conditions on X (restricted eigenvalue, etc.)
Proper tuning: λ → 0 and λ√n → ∞ as n → ∞

Critically, they do not require the irrepresentability condition needed for Lasso's model selection consistency — a substantially weaker requirement.

Practical Implications

Oracle properties matter practically when:

Accurate coefficient magnitude estimation is important (not just prediction)
Downstream inference or confidence intervals are needed
True support recovery is scientifically meaningful
Correlations among predictors might violate Lasso's conditions

Summary

Key Takeaways

•Lasso's limitation: Uniform shrinkage biases large coefficients; cannot achieve oracle properties in general
•SCAD and MCP penalize small coefficients like Lasso but leave large coefficients nearly unpenalized
•Non-convexity is the price: local minima are possible, but LLA optimization with Lasso initialization works well
•Oracle properties: Under mild conditions, SCAD/MCP correctly identify support and achieve efficient estimation
•Practical guidance: Use when coefficient magnitude accuracy matters; Lasso warm-start mitigates optimization challenges

Page Complete

4 / 5