Loading content...
Standard regularization methods—Ridge, Lasso, Elastic Net—treat all coefficients uniformly. They assume no prior knowledge about relationships between features. But in many real-world problems, we possess valuable structural information:
Structured regularization incorporates this domain knowledge directly into the penalty term, encouraging solutions that respect the known structure. This leads to more interpretable models and often improves both prediction and support recovery.
By the end of this page, you will understand how to encode various structural constraints through penalty design, including graph-based regularization, hierarchical penalties, overlapping group structures, and sparse-group combinations. You'll gain the ability to design custom penalties for novel domain structures.
When features are connected by a graph G = (V, E), we can encourage connected features to have similar coefficients using the graph Laplacian penalty:
$$\Omega_G(\boldsymbol{\beta}) = \sum_{(i,j) \in E} w_{ij}(\beta_i - \beta_j)^2 = \boldsymbol{\beta}^T \mathbf{L} \boldsymbol{\beta}$$
where L is the graph Laplacian matrix:
This penalty is a smoothness prior over the graph: coefficients of connected features should be similar.
The complete objective for graph-regularized regression:
$$\min_{\boldsymbol{\beta}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda_1|\boldsymbol{\beta}|_1 + \lambda_2 \boldsymbol{\beta}^T \mathbf{L} \boldsymbol{\beta}$$
This combines:
In genomics, genes interact in regulatory networks. When predicting disease status:
The regularization is only as good as the graph structure. Options include: (1) Prior knowledge graphs (e.g., known gene interactions), (2) Correlation-based graphs (connect features with |cor| > threshold), (3) k-nearest neighbor graphs, (4) Learned graphs via graphical models.
In many domains, features have hierarchical relationships where parent features should be selected before children:
For a tree-structured hierarchy, the hierarchical Lasso enforces:
If coefficient βⱼ is nonzero, then all ancestors of j must be nonzero.
This is achieved by defining groups along root-to-leaf paths:
$$\Omega_H(\boldsymbol{\beta}) = \sum_{g \in \text{paths}} |\boldsymbol{\beta}_g|_2$$
where each path group g contains a feature and all its ancestors.
| Domain | Hierarchy | Constraint Effect |
|---|---|---|
| Text classification | Word → Phrase → Sentence | Larger units selected before constituents |
| Medical diagnosis | Symptom category → Specific symptom | General before specific |
| Marketing | Customer segment → Sub-segment | Broad targeting before micro-targeting |
| Polynomial features | Linear → Quadratic → Cubic | Main effects before interactions |
Two flavors of hierarchical constraints exist:
Strong hierarchy: Parent must be nonzero for child to be nonzero
Weak hierarchy: Parent can be zero if child is nonzero, but penalty encourages otherwise
The strong hierarchy constraint is non-convex in general, but can be approximated through carefully designed group penalties.
Standard Group Lasso assumes non-overlapping groups, but real structures often have overlapping memberships:
For overlapping groups G₁, G₂, ..., G_K, introduce latent variables v⁽ᵍ⁾ for each group:
$$\min_{\boldsymbol{\beta}, {\mathbf{v}^{(g)}}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \lambda \sum{g=1}^{K}|\mathbf{v}^{(g)}|2$$ $$\text{subject to } \beta_j = \sum{g: j \in G_g} v_j^{(g)}$$
Each coefficient βⱼ is the sum of contributions from all groups it belongs to. The penalty encourages entire groups of latent variables to be zero.
Overlapping groups require more sophisticated optimization. The proximal operator no longer has a closed form, typically requiring iterative inner loops. For large-scale problems, consider approximations or specialized algorithms like ADMM with smart splitting.
Standard Group Lasso has an all-or-nothing property: if a group is selected, all its coefficients are nonzero. But we often want:
Sparse Group Lasso achieves both by combining Group Lasso with standard Lasso:
$$\min_{\boldsymbol{\beta}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \alpha\lambda|\boldsymbol{\beta}|1 + (1-\alpha)\lambda\sum{g=1}^{G}\sqrt{p_g}|\boldsymbol{\beta}^{(g)}|_2$$
where α ∈ [0, 1] balances element-wise and group-wise sparsity.
| α Value | Behavior | Use Case |
|---|---|---|
| α = 0 | Pure Group Lasso | All features in selected groups are relevant |
| α = 0.5 | Balanced | Some group structure, some within-group sparsity |
| α = 1 | Pure Lasso | No group structure, individual selection |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as np def sparse_group_soft_threshold(z, lambda1, lambda2): """ Proximal operator for sparse group lasso penalty. Finds: argmin_v (1/2)||v - z||^2 + lambda1*||v||_1 + lambda2*||v||_2 Parameters: ----------- z : ndarray - Input vector lambda1 : float - L1 penalty weight lambda2 : float - Group L2 penalty weight Returns: -------- v : ndarray - Proximal solution """ # First apply element-wise soft thresholding (L1) v = np.sign(z) * np.maximum(np.abs(z) - lambda1, 0) # Then apply group soft thresholding (L2) norm_v = np.linalg.norm(v) if norm_v > lambda2: v = (1 - lambda2 / norm_v) * v else: v = np.zeros_like(z) return v def fit_sparse_group_lasso(X, y, groups, lambda_param, alpha=0.5, max_iter=1000, tol=1e-6): """ Fit Sparse Group Lasso via block coordinate descent. Parameters: ----------- X : ndarray (n, p) - Design matrix y : ndarray (n,) - Response groups : list of lists - Group structure lambda_param : float - Overall regularization alpha : float - Balance between L1 and group L2 (0 to 1) max_iter : int - Maximum iterations tol : float - Convergence tolerance Returns: -------- beta : ndarray (p,) - Fitted coefficients """ n, p = X.shape beta = np.zeros(p) # Precompute group norms group_sizes = [len(g) for g in groups] group_norms = [np.linalg.norm(X[:, g], 'fro')**2 / n for g in groups] for iteration in range(max_iter): beta_old = beta.copy() for g_idx, g in enumerate(groups): # Compute partial residual r_g = y - X @ beta + X[:, g] @ beta[g] # Gradient step z_g = X[:, g].T @ r_g / n # Sparse group soft thresholding lambda1 = alpha * lambda_param lambda2 = (1 - alpha) * lambda_param * np.sqrt(group_sizes[g_idx]) beta[g] = sparse_group_soft_threshold(z_g, lambda1, lambda2) beta[g] = beta[g] / group_norms[g_idx] if np.linalg.norm(beta - beta_old) < tol: print(f"Converged at iteration {iteration}") break return betaStructured regularization follows a general pattern:
$$\min_{\boldsymbol{\beta}} \mathcal{L}(\boldsymbol{\beta}) + \Omega(\boldsymbol{\beta})$$
where Ω(β) encodes the structural prior. To design a custom penalty:
Norm combinations:
Linear transformations:
Weighted penalties:
Use structured regularization when: (1) Genuine prior knowledge exists about feature relationships, (2) Interpretability of the structure is important, (3) Standard methods fail to exploit known patterns, (4) The structure is reliably known (incorrect structure can hurt performance).
Congratulations! You've completed Module 6: Advanced Regularization. You now possess a comprehensive understanding of regularization beyond basic Ridge and Lasso—from Group Lasso and Fused Lasso through Bridge Regression, non-convex penalties, and structured regularization. These tools enable you to tackle complex real-world problems where domain structure matters.