Loading learning content...
Standard Lasso regression enforces sparsity at the individual coefficient level—each feature is independently shrunk toward zero, and some coefficients become exactly zero. While powerful, this element-wise approach ignores an important reality: features often belong to natural groups.
Consider these scenarios:
Group Lasso extends Lasso to handle these structured sparsity patterns, enabling group-level variable selection while maintaining the computational tractability that makes Lasso practical.
By the end of this page, you will understand the mathematical formulation of Group Lasso, its geometric interpretation, optimization algorithms, and practical applications. You'll be equipped to recognize when group structure exists in your data and how to exploit it for better model performance.
Let us partition the coefficient vector β ∈ ℝᵖ into G non-overlapping groups:
$$\boldsymbol{\beta} = (\boldsymbol{\beta}^{(1)}, \boldsymbol{\beta}^{(2)}, \ldots, \boldsymbol{\beta}^{(G)})$$
where β⁽ᵍ⁾ ∈ ℝᵖᵍ contains the coefficients for group g, and Σₘ pₘ = p.
The Group Lasso objective function is:
$$\min_{\boldsymbol{\beta}} \frac{1}{2n} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \lambda \sum{g=1}^{G} \sqrt{p_g} |\boldsymbol{\beta}^{(g)}|_2$$
Let's dissect each component:
The Group Lasso penalty is often written as an L2,1 mixed norm:
$$\Omega(\boldsymbol{\beta}) = |\boldsymbol{\beta}|{2,1} = \sum{g=1}^{G} |\boldsymbol{\beta}^{(g)}|_2$$
This notation emphasizes the two-level structure:
The L1 structure at the group level induces group sparsity—entire groups are shrunk to zero. The L2 structure within groups means no sparsity within groups—if a group is selected, all its coefficients are nonzero.
Group Lasso enforces an all-or-nothing selection at the group level. Either ||β⁽ᵍ⁾||₂ = 0 (entire group excluded) or ||β⁽ᵍ⁾||₂ > 0 (all coefficients in group are nonzero). This contrasts with standard Lasso, which can select arbitrary subsets of features.
The group size weighting √pᵍ serves a crucial normalization purpose. Without it, larger groups would have systematically larger L2 norms simply due to having more coefficients, creating bias toward selecting smaller groups.
To see this, consider two groups with coefficients of magnitude 1:
Without weighting, Group 2 would incur twice the penalty despite coefficients having identical magnitudes. The √pᵍ weighting equalizes:
This ensures groups are penalized proportionally to their contribution, not their size.
Just as Lasso's L1 penalty corresponds to a diamond-shaped constraint region and Ridge's L2 penalty corresponds to a spherical constraint region, Group Lasso's penalty defines a distinctive geometric shape.
Consider the constraint form of Group Lasso:
$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 \quad \text{subject to} \quad \sum{g=1}^{G} |\boldsymbol{\beta}^{(g)}|_2 \leq t$$
For a simple case with two groups of two variables each (β = (β₁, β₂, β₃, β₄)), the constraint becomes:
$$\sqrt{\beta_1^2 + \beta_2^2} + \sqrt{\beta_3^2 + \beta_4^2} \leq t$$
The Group Lasso constraint region has a distinctive shape: it is 'round' within groups (L2 creates circular/spherical cross-sections) but has 'corners' between groups (L1 structure). These corners lie along the subspaces where entire groups are zero, enabling group sparsity when the loss contour touches these corners.
Imagine the 4D constraint region projected into understandable components:
This creates a shape that looks like a hypercylinder with corners—smooth and round where you slice through a single group, but with sharp edges where groups meet.
The geometric intuition for sparsity follows the same logic as standard Lasso:
The probability of landing exactly on a corner (inducing group sparsity) increases with:
| Method | Penalty | Constraint Shape | Sparsity Type |
|---|---|---|---|
| Ridge (L2) | ||β||₂² | Hypersphere (smooth everywhere) | No sparsity |
| Lasso (L1) | ||β||₁ | Hypercube (corners at axes) | Element-wise sparsity |
| Group Lasso | Σ||β⁽ᵍ⁾||₂ | Hybrid: round within groups, corners between | Group-wise sparsity |
Like the Lasso, Group Lasso is non-differentiable when any group's coefficient vector equals zero. The L2 norm ||β⁽ᵍ⁾||₂ has no gradient at β⁽ᵍ⁾ = 0 because the function has a 'kink' at the origin.
To characterize optimality, we use subdifferential calculus. The subdifferential of the Group Lasso penalty with respect to group g is:
$$\partial |\boldsymbol{\beta}^{(g)}|_2 = \begin{cases} \left{ \frac{\boldsymbol{\beta}^{(g)}}{|\boldsymbol{\beta}^{(g)}|_2} \right} & \text{if } \boldsymbol{\beta}^{(g)} \neq \mathbf{0} \\ { \mathbf{v} : |\mathbf{v}|_2 \leq 1 } & \text{if } \boldsymbol{\beta}^{(g)} = \mathbf{0} \end{cases}$$
Define the residual vector r = y - Xβ and let X⁽ᵍ⁾ denote the columns of X corresponding to group g. The KKT optimality conditions state that β* is optimal if and only if for each group g:
Case 1: Active group (β⁽ᵍ⁾ ≠ 0) $$\frac{1}{n}(\mathbf{X}^{(g)})^T \mathbf{r} = \lambda \sqrt{p_g} \frac{\boldsymbol{\beta}^{(g)}}{|\boldsymbol{\beta}^{(g)}|_2}$$
Case 2: Inactive group (β⁽ᵍ⁾ = 0) $$\left| \frac{1}{n}(\mathbf{X}^{(g)})^T \mathbf{r} \right|_2 \leq \lambda \sqrt{p_g}$$
These conditions have intuitive interpretations:
A group enters the model (becomes nonzero) only when its aggregate correlation with the residual exceeds λ√pᵍ. This is fundamentally different from Lasso, where each coefficient has its own threshold. In Group Lasso, weak individual correlations can combine to exceed the group threshold.
The most popular algorithm for Group Lasso is block coordinate descent, which cycles through groups and optimizes each group's coefficients while holding others fixed.
For group g, the subproblem is:
$$\min_{\boldsymbol{\beta}^{(g)}} \frac{1}{2n} |\mathbf{r}^{(-g)} - \mathbf{X}^{(g)}\boldsymbol{\beta}^{(g)}|_2^2 + \lambda\sqrt{p_g}|\boldsymbol{\beta}^{(g)}|_2$$
where r⁽⁻ᵍ⁾ = y - Σₕ≠ᵍ X⁽ʰ⁾β⁽ʰ⁾ is the partial residual excluding group g.
This has a closed-form solution using the group soft-thresholding operator:
$$\boldsymbol{\beta}^{(g)} = \frac{1}{|\mathbf{X}^{(g)}|_F^2/n} \left(1 - \frac{\lambda\sqrt{p_g}}{|\mathbf{z}^{(g)}|2}\right)+ \mathbf{z}^{(g)}$$
where z⁽ᵍ⁾ = (X⁽ᵍ⁾)ᵀr⁽⁻ᵍ⁾/n and (·)₊ = max(·, 0).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import numpy as np def group_soft_threshold(z, threshold): """Apply group soft-thresholding operator.""" norm_z = np.linalg.norm(z) if norm_z <= threshold: return np.zeros_like(z) else: return (1 - threshold / norm_z) * z def group_lasso_bcd(X, y, groups, lambda_param, max_iter=1000, tol=1e-6): """ Block Coordinate Descent for Group Lasso. Parameters: ----------- X : ndarray of shape (n, p) - Design matrix y : ndarray of shape (n,) - Response vector groups : list of lists - Each sublist contains column indices for a group lambda_param : float - Regularization parameter max_iter : int - Maximum iterations tol : float - Convergence tolerance Returns: -------- beta : ndarray of shape (p,) - Coefficient vector """ n, p = X.shape beta = np.zeros(p) # Precompute group norms for normalization group_scales = [np.linalg.norm(X[:, g], 'fro')**2 / n for g in groups] group_sizes = [len(g) for g in groups] for iteration in range(max_iter): beta_old = beta.copy() for g_idx, g in enumerate(groups): # Compute partial residual (excluding group g) r_minus_g = y - X @ beta + X[:, g] @ beta[g] # Compute gradient for group g z_g = X[:, g].T @ r_minus_g / n # Group soft-thresholding threshold = lambda_param * np.sqrt(group_sizes[g_idx]) beta[g] = group_soft_threshold(z_g, threshold) / group_scales[g_idx] # Check convergence if np.linalg.norm(beta - beta_old) < tol: print(f"Converged at iteration {iteration}") break return betaAn alternative approach uses proximal gradient descent (ISTA/FISTA), treating Group Lasso as a composite optimization problem:
$$\min_\beta f(\beta) + g(\beta)$$
where f(β) = (1/2n)||y - Xβ||₂² is smooth and g(β) = λΣᵍ√pᵍ||β⁽ᵍ⁾||₂ is non-smooth but has a tractable proximal operator.
The proximal operator for the Group Lasso penalty is:
$$\text{prox}{\eta g}(\mathbf{u}) = \left( S{\eta\lambda\sqrt{p_g}}(\mathbf{u}^{(g)}) \right)_{g=1}^G$$
where S is the group soft-thresholding operator applied independently to each group.
FISTA acceleration can achieve O(1/k²) convergence rate compared to O(1/k) for standard proximal gradient descent.
Perhaps the most common application of Group Lasso is handling categorical features in regression. When a categorical variable with k levels is one-hot encoded, it creates k-1 dummy variables that should be treated as a group.
Standard Lasso might select 3 out of 5 dummy variables for 'day of week', making interpretation difficult—what does it mean to include Tuesday and Thursday but not Wednesday? Group Lasso selects the entire categorical variable or none of it, preserving interpretability.
In multi-task learning, we have T related prediction tasks sharing the same features. We want to identify features that are relevant across all tasks.
Arranging coefficients as a matrix B ∈ ℝᵖˣᵀ where each column is one task's coefficients, we apply Group Lasso with row groups:
$$\min_{\mathbf{B}} \sum_{t=1}^T |\mathbf{y}_t - \mathbf{X}\boldsymbol{\beta}t|2^2 + \lambda \sum{j=1}^p |\mathbf{B}{j,:}|_2$$
This encourages entire rows of B to be zero, meaning feature j is either selected for all tasks or none.
In gene expression studies, genes participate in functional pathways. Grouping genes by pathway and applying Group Lasso:
| Domain | Natural Groups | Benefit of Group Structure |
|---|---|---|
| Survey data | Multi-item scales for constructs | Select constructs, not individual items |
| Time series | Lags of the same variable | Include all lags or none for each variable |
| Image analysis | Spatial regions or patches | Preserve spatial coherence in selection |
| NLP | N-grams from same root word | Select word families together |
| Finance | Sector-grouped assets | Sector-level portfolio decisions |
You now understand Group Lasso—the fundamental extension of Lasso for structured sparsity. Next, we'll explore Fused Lasso, which enforces smoothness between coefficients of adjacent features, ideal for ordered data like time series and spatial signals.