Machine LearningAdvanced Regularization

Advanced Regularization Techniques

LevelAdvanced

Duration90 mins

TopicAdvanced Regularization

1 / 5

Group Lasso

Beyond Element-wise Sparsity

Standard Lasso regression enforces sparsity at the individual coefficient level—each feature is independently shrunk toward zero, and some coefficients become exactly zero. While powerful, this element-wise approach ignores an important reality: features often belong to natural groups.

Consider these scenarios:

Categorical variables: A categorical feature with 10 levels requires 9 dummy variables. Standard Lasso might select some levels while excluding others, producing uninterpretable results.
Polynomial features: When modeling nonlinear relationships with polynomial terms (x, x², x³), we typically want all terms for a variable included or excluded together.
Gene pathways: In genomics, genes operate in biological pathways. Selecting individual genes without considering pathway structure may miss functional relationships.
Multi-task learning: When predicting multiple related outcomes, we often want the same features to be relevant across all tasks.

Group Lasso extends Lasso to handle these structured sparsity patterns, enabling group-level variable selection while maintaining the computational tractability that makes Lasso practical.

What You Will Learn

By the end of this page, you will understand the mathematical formulation of Group Lasso, its geometric interpretation, optimization algorithms, and practical applications. You'll be equipped to recognize when group structure exists in your data and how to exploit it for better model performance.

Mathematical Formulation

The Group Lasso Objective

Let us partition the coefficient vector β ∈ ℝᵖ into G non-overlapping groups:

$$\boldsymbol{\beta} = (\boldsymbol{\beta}^{(1)}, \boldsymbol{\beta}^{(2)}, \ldots, \boldsymbol{\beta}^{(G)})$$

where β⁽ᵍ⁾ ∈ ℝᵖᵍ contains the coefficients for group g, and Σₘ pₘ = p.

The Group Lasso objective function is:

$$\min_{\boldsymbol{\beta}} \frac{1}{2n} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \lambda \sum{g=1}^{G} \sqrt{p_g} |\boldsymbol{\beta}^{(g)}|_2$$

Let's dissect each component:

Objective Function Components

•Loss term: (1/2n)||y - Xβ||₂² — Standard squared error loss measuring prediction accuracy
•Group penalty: Σᵍ √pᵍ ||β⁽ᵍ⁾||₂ — Sum of L2 norms of coefficient groups, weighted by √pᵍ
•Regularization parameter: λ ≥ 0 — Controls the overall sparsity level
•Group weight: √pᵍ — Normalizes for group size, preventing bias toward smaller groups

The L2,1 Mixed Norm

The Group Lasso penalty is often written as an L2,1 mixed norm:

$$\Omega(\boldsymbol{\beta}) = |\boldsymbol{\beta}|{2,1} = \sum{g=1}^{G} |\boldsymbol{\beta}^{(g)}|_2$$

This notation emphasizes the two-level structure:

Inner norm (L2): Within each group, coefficients are measured using the Euclidean norm
Outer norm (L1): Across groups, the L2 norms are summed (L1 structure)

The L1 structure at the group level induces group sparsity—entire groups are shrunk to zero. The L2 structure within groups means no sparsity within groups—if a group is selected, all its coefficients are nonzero.

The Sparsity Dichotomy

Group Lasso enforces an all-or-nothing selection at the group level. Either ||β⁽ᵍ⁾||₂ = 0 (entire group excluded) or ||β⁽ᵍ⁾||₂ > 0 (all coefficients in group are nonzero). This contrasts with standard Lasso, which can select arbitrary subsets of features.

Why Weight by √pᵍ?

The group size weighting √pᵍ serves a crucial normalization purpose. Without it, larger groups would have systematically larger L2 norms simply due to having more coefficients, creating bias toward selecting smaller groups.

To see this, consider two groups with coefficients of magnitude 1:

Group 1: p₁ = 4 coefficients → ||β⁽¹⁾||₂ = √4 = 2
Group 2: p₂ = 16 coefficients → ||β⁽²⁾||₂ = √16 = 4

Without weighting, Group 2 would incur twice the penalty despite coefficients having identical magnitudes. The √pᵍ weighting equalizes:

Group 1: √4 · 2 = 4
Group 2: √16 · 4 / √16 = 4

This ensures groups are penalized proportionally to their contribution, not their size.

Geometric Interpretation

Constraint Region Geometry

Just as Lasso's L1 penalty corresponds to a diamond-shaped constraint region and Ridge's L2 penalty corresponds to a spherical constraint region, Group Lasso's penalty defines a distinctive geometric shape.

Consider the constraint form of Group Lasso:

$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 \quad \text{subject to} \quad \sum{g=1}^{G} |\boldsymbol{\beta}^{(g)}|_2 \leq t$$

For a simple case with two groups of two variables each (β = (β₁, β₂, β₃, β₄)), the constraint becomes:

$$\sqrt{\beta_1^2 + \beta_2^2} + \sqrt{\beta_3^2 + \beta_4^2} \leq t$$

Cylindrical Corners

The Group Lasso constraint region has a distinctive shape: it is 'round' within groups (L2 creates circular/spherical cross-sections) but has 'corners' between groups (L1 structure). These corners lie along the subspaces where entire groups are zero, enabling group sparsity when the loss contour touches these corners.

Visualization: 2D Groups in 4D Space

Imagine the 4D constraint region projected into understandable components:

Within group 1: The constraint on (β₁, β₂) forms a circle of radius r₁
Within group 2: The constraint on (β₃, β₄) forms a circle of radius r₂
Between groups: The radii satisfy r₁ + r₂ ≤ t (L1 constraint on group norms)

This creates a shape that looks like a hypercylinder with corners—smooth and round where you slice through a single group, but with sharp edges where groups meet.

Sparsity from Geometry

The geometric intuition for sparsity follows the same logic as standard Lasso:

The loss function defines elliptical contours centered at the OLS solution
We seek the point where the loss contour first touches the constraint region
Due to the corners at group boundaries, this contact point often lies on a face where one or more entire groups are zero

The probability of landing exactly on a corner (inducing group sparsity) increases with:

Higher regularization strength λ
More groups G
Smaller group sizes pᵍ

Constraint Region Comparison
Method	Penalty	Constraint Shape	Sparsity Type
Ridge (L2)	\|\|β\|\|₂²	Hypersphere (smooth everywhere)	No sparsity
Lasso (L1)	\|\|β\|\|₁	Hypercube (corners at axes)	Element-wise sparsity
Group Lasso	Σ\|\|β⁽ᵍ⁾\|\|₂	Hybrid: round within groups, corners between	Group-wise sparsity

Subdifferential and Optimality Conditions

Non-differentiability at Zero

Like the Lasso, Group Lasso is non-differentiable when any group's coefficient vector equals zero. The L2 norm ||β⁽ᵍ⁾||₂ has no gradient at β⁽ᵍ⁾ = 0 because the function has a 'kink' at the origin.

To characterize optimality, we use subdifferential calculus. The subdifferential of the Group Lasso penalty with respect to group g is:

$$\partial |\boldsymbol{\beta}^{(g)}|_2 = \begin{cases} \left{ \frac{\boldsymbol{\beta}^{(g)}}{|\boldsymbol{\beta}^{(g)}|_2} \right} & \text{if } \boldsymbol{\beta}^{(g)} \neq \mathbf{0} \\ { \mathbf{v} : |\mathbf{v}|_2 \leq 1 } & \text{if } \boldsymbol{\beta}^{(g)} = \mathbf{0} \end{cases}$$

Subdifferential Interpretation

•When β⁽ᵍ⁾ ≠ 0: The subdifferential is a singleton—the unique gradient pointing radially outward from the origin with unit length
•When β⁽ᵍ⁾ = 0: The subdifferential is the entire unit ball in ℝᵖᵍ, reflecting the 'kink' where infinitely many supporting hyperplanes exist

KKT Conditions for Group Lasso

Define the residual vector r = y - Xβ and let X⁽ᵍ⁾ denote the columns of X corresponding to group g. The KKT optimality conditions state that β* is optimal if and only if for each group g:

Case 1: Active group (β⁽ᵍ⁾ ≠ 0) $$\frac{1}{n}(\mathbf{X}^{(g)})^T \mathbf{r} = \lambda \sqrt{p_g} \frac{\boldsymbol{\beta}^{(g)}}{|\boldsymbol{\beta}^{(g)}|_2}$$

Case 2: Inactive group (β⁽ᵍ⁾ = 0) $$\left| \frac{1}{n}(\mathbf{X}^{(g)})^T \mathbf{r} \right|_2 \leq \lambda \sqrt{p_g}$$

These conditions have intuitive interpretations:

Active groups: The correlation between group g's features and the residual must equal λ√pᵍ times a unit vector in the direction of β⁽ᵍ⁾
Inactive groups: The correlation magnitude must be at most λ√pᵍ; if it were larger, the group would enter the model

Group Entry Threshold

A group enters the model (becomes nonzero) only when its aggregate correlation with the residual exceeds λ√pᵍ. This is fundamentally different from Lasso, where each coefficient has its own threshold. In Group Lasso, weak individual correlations can combine to exceed the group threshold.

Optimization Algorithms

Block Coordinate Descent

The most popular algorithm for Group Lasso is block coordinate descent, which cycles through groups and optimizes each group's coefficients while holding others fixed.

For group g, the subproblem is:

$$\min_{\boldsymbol{\beta}^{(g)}} \frac{1}{2n} |\mathbf{r}^{(-g)} - \mathbf{X}^{(g)}\boldsymbol{\beta}^{(g)}|_2^2 + \lambda\sqrt{p_g}|\boldsymbol{\beta}^{(g)}|_2$$

where r⁽⁻ᵍ⁾ = y - Σₕ≠ᵍ X⁽ʰ⁾β⁽ʰ⁾ is the partial residual excluding group g.

This has a closed-form solution using the group soft-thresholding operator:

$$\boldsymbol{\beta}^{(g)} = \frac{1}{|\mathbf{X}^{(g)}|_F^2/n} \left(1 - \frac{\lambda\sqrt{p_g}}{|\mathbf{z}^{(g)}|2}\right)+ \mathbf{z}^{(g)}$$

where z⁽ᵍ⁾ = (X⁽ᵍ⁾)ᵀr⁽⁻ᵍ⁾/n and (·)₊ = max(·, 0).

group_lasso_bcd.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
 
def group_soft_threshold(z, threshold):
    """Apply group soft-thresholding operator."""
    norm_z = np.linalg.norm(z)
    if norm_z <= threshold:
        return np.zeros_like(z)
    else:
        return (1 - threshold / norm_z) * z
 
def group_lasso_bcd(X, y, groups, lambda_param, max_iter=1000, tol=1e-6):
    """
    Block Coordinate Descent for Group Lasso.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p) - Design matrix
    y : ndarray of shape (n,) - Response vector
    groups : list of lists - Each sublist contains column indices for a group
    lambda_param : float - Regularization parameter
    max_iter : int - Maximum iterations
    tol : float - Convergence tolerance
    
    Returns:
    --------
    beta : ndarray of shape (p,) - Coefficient vector
    """
    n, p = X.shape
    beta = np.zeros(p)
    
    # Precompute group norms for normalization
    group_scales = [np.linalg.norm(X[:, g], 'fro')**2 / n for g in groups]
    group_sizes = [len(g) for g in groups]
    
    for iteration in range(max_iter):
        beta_old = beta.copy()
        
        for g_idx, g in enumerate(groups):
            # Compute partial residual (excluding group g)
            r_minus_g = y - X @ beta + X[:, g] @ beta[g]
            
            # Compute gradient for group g
            z_g = X[:, g].T @ r_minus_g / n
            
            # Group soft-thresholding
            threshold = lambda_param * np.sqrt(group_sizes[g_idx])
            beta[g] = group_soft_threshold(z_g, threshold) / group_scales[g_idx]
        
        # Check convergence
        if np.linalg.norm(beta - beta_old) < tol:
            print(f"Converged at iteration {iteration}")
            break
    
    return beta

Proximal Gradient Descent

An alternative approach uses proximal gradient descent (ISTA/FISTA), treating Group Lasso as a composite optimization problem:

$$\min_\beta f(\beta) + g(\beta)$$

where f(β) = (1/2n)||y - Xβ||₂² is smooth and g(β) = λΣᵍ√pᵍ||β⁽ᵍ⁾||₂ is non-smooth but has a tractable proximal operator.

The proximal operator for the Group Lasso penalty is:

$$\text{prox}{\eta g}(\mathbf{u}) = \left( S{\eta\lambda\sqrt{p_g}}(\mathbf{u}^{(g)}) \right)_{g=1}^G$$

where S is the group soft-thresholding operator applied independently to each group.

FISTA acceleration can achieve O(1/k²) convergence rate compared to O(1/k) for standard proximal gradient descent.

Practical Applications

Categorical Variable Selection

Perhaps the most common application of Group Lasso is handling categorical features in regression. When a categorical variable with k levels is one-hot encoded, it creates k-1 dummy variables that should be treated as a group.

Why Group Categorical Variables?

Standard Lasso might select 3 out of 5 dummy variables for 'day of week', making interpretation difficult—what does it mean to include Tuesday and Thursday but not Wednesday? Group Lasso selects the entire categorical variable or none of it, preserving interpretability.

Multi-task Learning

In multi-task learning, we have T related prediction tasks sharing the same features. We want to identify features that are relevant across all tasks.

Arranging coefficients as a matrix B ∈ ℝᵖˣᵀ where each column is one task's coefficients, we apply Group Lasso with row groups:

$$\min_{\mathbf{B}} \sum_{t=1}^T |\mathbf{y}_t - \mathbf{X}\boldsymbol{\beta}t|2^2 + \lambda \sum{j=1}^p |\mathbf{B}{j,:}|_2$$

This encourages entire rows of B to be zero, meaning feature j is either selected for all tasks or none.

Genomics and Biological Pathways

In gene expression studies, genes participate in functional pathways. Grouping genes by pathway and applying Group Lasso:

Identifies which pathways are associated with the outcome
Accounts for the co-regulation of genes within pathways
Improves interpretability by providing pathway-level rather than gene-level results

Group Lasso Application Domains
Domain	Natural Groups	Benefit of Group Structure
Survey data	Multi-item scales for constructs	Select constructs, not individual items
Time series	Lags of the same variable	Include all lags or none for each variable
Image analysis	Spatial regions or patches	Preserve spatial coherence in selection
NLP	N-grams from same root word	Select word families together
Finance	Sector-grouped assets	Sector-level portfolio decisions

Summary

Key Takeaways

•Group Lasso extends Lasso with an L2,1 mixed norm penalty that enforces sparsity at the group level rather than individual coefficients
•All-or-nothing selection: Entire groups are included or excluded; within selected groups, all coefficients are nonzero
•√pᵍ weighting normalizes for group size, preventing bias toward smaller groups
•Block coordinate descent with group soft-thresholding provides efficient optimization
•Primary applications include categorical variables, polynomial features, multi-task learning, and biological pathway analysis

Page Complete

You now understand Group Lasso—the fundamental extension of Lasso for structured sparsity. Next, we'll explore Fused Lasso, which enforces smoothness between coefficients of adjacent features, ideal for ordered data like time series and spatial signals.

1 / 5

Loading learning content...

Machine LearningAdvanced Regularization

Advanced Regularization Techniques

LevelAdvanced

Duration90 mins

TopicAdvanced Regularization

1 / 5

Group Lasso

Beyond Element-wise Sparsity

Consider these scenarios:

Categorical variables: A categorical feature with 10 levels requires 9 dummy variables. Standard Lasso might select some levels while excluding others, producing uninterpretable results.
Polynomial features: When modeling nonlinear relationships with polynomial terms (x, x², x³), we typically want all terms for a variable included or excluded together.
Gene pathways: In genomics, genes operate in biological pathways. Selecting individual genes without considering pathway structure may miss functional relationships.
Multi-task learning: When predicting multiple related outcomes, we often want the same features to be relevant across all tasks.

Group Lasso extends Lasso to handle these structured sparsity patterns, enabling group-level variable selection while maintaining the computational tractability that makes Lasso practical.

What You Will Learn

Mathematical Formulation

The Group Lasso Objective

Let us partition the coefficient vector β ∈ ℝᵖ into G non-overlapping groups:

$$\boldsymbol{\beta} = (\boldsymbol{\beta}^{(1)}, \boldsymbol{\beta}^{(2)}, \ldots, \boldsymbol{\beta}^{(G)})$$

where β⁽ᵍ⁾ ∈ ℝᵖᵍ contains the coefficients for group g, and Σₘ pₘ = p.

The Group Lasso objective function is:

$$\min_{\boldsymbol{\beta}} \frac{1}{2n} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \lambda \sum{g=1}^{G} \sqrt{p_g} |\boldsymbol{\beta}^{(g)}|_2$$

Let's dissect each component:

Objective Function Components

•Loss term: (1/2n)||y - Xβ||₂² — Standard squared error loss measuring prediction accuracy
•Group penalty: Σᵍ √pᵍ ||β⁽ᵍ⁾||₂ — Sum of L2 norms of coefficient groups, weighted by √pᵍ
•Regularization parameter: λ ≥ 0 — Controls the overall sparsity level
•Group weight: √pᵍ — Normalizes for group size, preventing bias toward smaller groups

The L2,1 Mixed Norm

The Group Lasso penalty is often written as an L2,1 mixed norm:

$$\Omega(\boldsymbol{\beta}) = |\boldsymbol{\beta}|{2,1} = \sum{g=1}^{G} |\boldsymbol{\beta}^{(g)}|_2$$

This notation emphasizes the two-level structure:

Inner norm (L2): Within each group, coefficients are measured using the Euclidean norm
Outer norm (L1): Across groups, the L2 norms are summed (L1 structure)

The Sparsity Dichotomy

Why Weight by √pᵍ?

To see this, consider two groups with coefficients of magnitude 1:

Group 1: p₁ = 4 coefficients → ||β⁽¹⁾||₂ = √4 = 2
Group 2: p₂ = 16 coefficients → ||β⁽²⁾||₂ = √16 = 4

Without weighting, Group 2 would incur twice the penalty despite coefficients having identical magnitudes. The √pᵍ weighting equalizes:

Group 1: √4 · 2 = 4
Group 2: √16 · 4 / √16 = 4

This ensures groups are penalized proportionally to their contribution, not their size.

Geometric Interpretation

Constraint Region Geometry

Consider the constraint form of Group Lasso:

$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 \quad \text{subject to} \quad \sum{g=1}^{G} |\boldsymbol{\beta}^{(g)}|_2 \leq t$$

For a simple case with two groups of two variables each (β = (β₁, β₂, β₃, β₄)), the constraint becomes:

$$\sqrt{\beta_1^2 + \beta_2^2} + \sqrt{\beta_3^2 + \beta_4^2} \leq t$$

Cylindrical Corners

Visualization: 2D Groups in 4D Space

Imagine the 4D constraint region projected into understandable components:

Within group 1: The constraint on (β₁, β₂) forms a circle of radius r₁
Within group 2: The constraint on (β₃, β₄) forms a circle of radius r₂
Between groups: The radii satisfy r₁ + r₂ ≤ t (L1 constraint on group norms)

This creates a shape that looks like a hypercylinder with corners—smooth and round where you slice through a single group, but with sharp edges where groups meet.

Sparsity from Geometry

The geometric intuition for sparsity follows the same logic as standard Lasso:

The loss function defines elliptical contours centered at the OLS solution
We seek the point where the loss contour first touches the constraint region
Due to the corners at group boundaries, this contact point often lies on a face where one or more entire groups are zero

The probability of landing exactly on a corner (inducing group sparsity) increases with:

Higher regularization strength λ
More groups G
Smaller group sizes pᵍ

Constraint Region Comparison
Method	Penalty	Constraint Shape	Sparsity Type
Ridge (L2)	\|\|β\|\|₂²	Hypersphere (smooth everywhere)	No sparsity
Lasso (L1)	\|\|β\|\|₁	Hypercube (corners at axes)	Element-wise sparsity
Group Lasso	Σ\|\|β⁽ᵍ⁾\|\|₂	Hybrid: round within groups, corners between	Group-wise sparsity

Subdifferential and Optimality Conditions

Non-differentiability at Zero

To characterize optimality, we use subdifferential calculus. The subdifferential of the Group Lasso penalty with respect to group g is:

Subdifferential Interpretation

•When β⁽ᵍ⁾ ≠ 0: The subdifferential is a singleton—the unique gradient pointing radially outward from the origin with unit length
•When β⁽ᵍ⁾ = 0: The subdifferential is the entire unit ball in ℝᵖᵍ, reflecting the 'kink' where infinitely many supporting hyperplanes exist

KKT Conditions for Group Lasso

Define the residual vector r = y - Xβ and let X⁽ᵍ⁾ denote the columns of X corresponding to group g. The KKT optimality conditions state that β* is optimal if and only if for each group g:

Case 1: Active group (β⁽ᵍ⁾ ≠ 0) $$\frac{1}{n}(\mathbf{X}^{(g)})^T \mathbf{r} = \lambda \sqrt{p_g} \frac{\boldsymbol{\beta}^{(g)}}{|\boldsymbol{\beta}^{(g)}|_2}$$

Case 2: Inactive group (β⁽ᵍ⁾ = 0) $$\left| \frac{1}{n}(\mathbf{X}^{(g)})^T \mathbf{r} \right|_2 \leq \lambda \sqrt{p_g}$$

These conditions have intuitive interpretations:

Active groups: The correlation between group g's features and the residual must equal λ√pᵍ times a unit vector in the direction of β⁽ᵍ⁾
Inactive groups: The correlation magnitude must be at most λ√pᵍ; if it were larger, the group would enter the model

Group Entry Threshold

Optimization Algorithms

Block Coordinate Descent

The most popular algorithm for Group Lasso is block coordinate descent, which cycles through groups and optimizes each group's coefficients while holding others fixed.

For group g, the subproblem is:

$$\min_{\boldsymbol{\beta}^{(g)}} \frac{1}{2n} |\mathbf{r}^{(-g)} - \mathbf{X}^{(g)}\boldsymbol{\beta}^{(g)}|_2^2 + \lambda\sqrt{p_g}|\boldsymbol{\beta}^{(g)}|_2$$

where r⁽⁻ᵍ⁾ = y - Σₕ≠ᵍ X⁽ʰ⁾β⁽ʰ⁾ is the partial residual excluding group g.

This has a closed-form solution using the group soft-thresholding operator:

$$\boldsymbol{\beta}^{(g)} = \frac{1}{|\mathbf{X}^{(g)}|_F^2/n} \left(1 - \frac{\lambda\sqrt{p_g}}{|\mathbf{z}^{(g)}|2}\right)+ \mathbf{z}^{(g)}$$

where z⁽ᵍ⁾ = (X⁽ᵍ⁾)ᵀr⁽⁻ᵍ⁾/n and (·)₊ = max(·, 0).

group_lasso_bcd.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
 
def group_soft_threshold(z, threshold):
    """Apply group soft-thresholding operator."""
    norm_z = np.linalg.norm(z)
    if norm_z <= threshold:
        return np.zeros_like(z)
    else:
        return (1 - threshold / norm_z) * z
 
def group_lasso_bcd(X, y, groups, lambda_param, max_iter=1000, tol=1e-6):
    """
    Block Coordinate Descent for Group Lasso.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p) - Design matrix
    y : ndarray of shape (n,) - Response vector
    groups : list of lists - Each sublist contains column indices for a group
    lambda_param : float - Regularization parameter
    max_iter : int - Maximum iterations
    tol : float - Convergence tolerance
    
    Returns:
    --------
    beta : ndarray of shape (p,) - Coefficient vector
    """
    n, p = X.shape
    beta = np.zeros(p)
    
    # Precompute group norms for normalization
    group_scales = [np.linalg.norm(X[:, g], 'fro')**2 / n for g in groups]
    group_sizes = [len(g) for g in groups]
    
    for iteration in range(max_iter):
        beta_old = beta.copy()
        
        for g_idx, g in enumerate(groups):
            # Compute partial residual (excluding group g)
            r_minus_g = y - X @ beta + X[:, g] @ beta[g]
            
            # Compute gradient for group g
            z_g = X[:, g].T @ r_minus_g / n
            
            # Group soft-thresholding
            threshold = lambda_param * np.sqrt(group_sizes[g_idx])
            beta[g] = group_soft_threshold(z_g, threshold) / group_scales[g_idx]
        
        # Check convergence
        if np.linalg.norm(beta - beta_old) < tol:
            print(f"Converged at iteration {iteration}")
            break
    
    return beta

Proximal Gradient Descent

An alternative approach uses proximal gradient descent (ISTA/FISTA), treating Group Lasso as a composite optimization problem:

$$\min_\beta f(\beta) + g(\beta)$$

where f(β) = (1/2n)||y - Xβ||₂² is smooth and g(β) = λΣᵍ√pᵍ||β⁽ᵍ⁾||₂ is non-smooth but has a tractable proximal operator.

The proximal operator for the Group Lasso penalty is:

$$\text{prox}{\eta g}(\mathbf{u}) = \left( S{\eta\lambda\sqrt{p_g}}(\mathbf{u}^{(g)}) \right)_{g=1}^G$$

where S is the group soft-thresholding operator applied independently to each group.

FISTA acceleration can achieve O(1/k²) convergence rate compared to O(1/k) for standard proximal gradient descent.

Practical Applications

Categorical Variable Selection

Why Group Categorical Variables?

Multi-task Learning

In multi-task learning, we have T related prediction tasks sharing the same features. We want to identify features that are relevant across all tasks.

Arranging coefficients as a matrix B ∈ ℝᵖˣᵀ where each column is one task's coefficients, we apply Group Lasso with row groups:

$$\min_{\mathbf{B}} \sum_{t=1}^T |\mathbf{y}_t - \mathbf{X}\boldsymbol{\beta}t|2^2 + \lambda \sum{j=1}^p |\mathbf{B}{j,:}|_2$$

This encourages entire rows of B to be zero, meaning feature j is either selected for all tasks or none.

Genomics and Biological Pathways

In gene expression studies, genes participate in functional pathways. Grouping genes by pathway and applying Group Lasso:

Identifies which pathways are associated with the outcome
Accounts for the co-regulation of genes within pathways
Improves interpretability by providing pathway-level rather than gene-level results

Group Lasso Application Domains
Domain	Natural Groups	Benefit of Group Structure
Survey data	Multi-item scales for constructs	Select constructs, not individual items
Time series	Lags of the same variable	Include all lags or none for each variable
Image analysis	Spatial regions or patches	Preserve spatial coherence in selection
NLP	N-grams from same root word	Select word families together
Finance	Sector-grouped assets	Sector-level portfolio decisions

Summary

Key Takeaways

•Group Lasso extends Lasso with an L2,1 mixed norm penalty that enforces sparsity at the group level rather than individual coefficients
•All-or-nothing selection: Entire groups are included or excluded; within selected groups, all coefficients are nonzero
•√pᵍ weighting normalizes for group size, preventing bias toward smaller groups
•Block coordinate descent with group soft-thresholding provides efficient optimization
•Primary applications include categorical variables, polynomial features, multi-task learning, and biological pathway analysis

Page Complete

1 / 5