Advanced Regularization - Learning Module

Loading content...

0/245

Structured Regularization

Encoding Domain Knowledge Through Penalties

Standard regularization methods—Ridge, Lasso, Elastic Net—treat all coefficients uniformly. They assume no prior knowledge about relationships between features. But in many real-world problems, we possess valuable structural information:

Graph structure: Features connected in a network (social networks, gene regulatory networks, spatial grids)
Hierarchies: Features organized in trees (taxonomies, organizational structures, nested categories)
Temporal ordering: Features representing consecutive time points
Spatial relationships: Features corresponding to spatial locations (brain regions, geographic areas)
Multi-resolution: Features at different scales (wavelets, image pyramids)

Structured regularization incorporates this domain knowledge directly into the penalty term, encouraging solutions that respect the known structure. This leads to more interpretable models and often improves both prediction and support recovery.

What You Will Learn

By the end of this page, you will understand how to encode various structural constraints through penalty design, including graph-based regularization, hierarchical penalties, overlapping group structures, and sparse-group combinations. You'll gain the ability to design custom penalties for novel domain structures.

Graph-Based Regularization

Graph Laplacian Penalty

When features are connected by a graph G = (V, E), we can encourage connected features to have similar coefficients using the graph Laplacian penalty:

$$\Omega_G(\boldsymbol{\beta}) = \sum_{(i,j) \in E} w_{ij}(\beta_i - \beta_j)^2 = \boldsymbol{\beta}^T \mathbf{L} \boldsymbol{\beta}$$

where L is the graph Laplacian matrix:

L = D - W (unnormalized)
D is the diagonal degree matrix: Dᵢᵢ = Σⱼ wᵢⱼ
W is the weighted adjacency matrix

This penalty is a smoothness prior over the graph: coefficients of connected features should be similar.

Graph-Constrained Regression

The complete objective for graph-regularized regression:

$$\min_{\boldsymbol{\beta}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda_1|\boldsymbol{\beta}|_1 + \lambda_2 \boldsymbol{\beta}^T \mathbf{L} \boldsymbol{\beta}$$

This combines:

Sparsity (L1 penalty): Select important features
Graph smoothness (Laplacian penalty): Similar coefficients for connected features

Example: Gene Networks

In genomics, genes interact in regulatory networks. When predicting disease status:

Genes in the same pathway often have correlated effects
Standard Lasso might select one gene while missing related genes
Graph regularization encourages selection of coherent gene modules

Graph Construction Matters

The regularization is only as good as the graph structure. Options include: (1) Prior knowledge graphs (e.g., known gene interactions), (2) Correlation-based graphs (connect features with |cor| > threshold), (3) k-nearest neighbor graphs, (4) Learned graphs via graphical models.

Hierarchical Penalties

The Hierarchical Selection Principle

In many domains, features have hierarchical relationships where parent features should be selected before children:

Product categories: "Electronics" should be selected before "Smartphones"
Geographic regions: "California" before "Los Angeles"
Gene ontology: Broad biological processes before specific molecular functions

Hierarchical Lasso

For a tree-structured hierarchy, the hierarchical Lasso enforces:

If coefficient βⱼ is nonzero, then all ancestors of j must be nonzero.

This is achieved by defining groups along root-to-leaf paths:

$$\Omega_H(\boldsymbol{\beta}) = \sum_{g \in \text{paths}} |\boldsymbol{\beta}_g|_2$$

where each path group g contains a feature and all its ancestors.

Hierarchical Constraint Examples
Domain	Hierarchy	Constraint Effect
Text classification	Word → Phrase → Sentence	Larger units selected before constituents
Medical diagnosis	Symptom category → Specific symptom	General before specific
Marketing	Customer segment → Sub-segment	Broad targeting before micro-targeting
Polynomial features	Linear → Quadratic → Cubic	Main effects before interactions

Strong and Weak Hierarchy

Two flavors of hierarchical constraints exist:

Strong hierarchy: Parent must be nonzero for child to be nonzero

If βⱼ ≠ 0, then β_parent(j) ≠ 0
Strict inclusion: active features form a connected subtree from root

Weak hierarchy: Parent can be zero if child is nonzero, but penalty encourages otherwise

Using overlapping groups with different weights
More flexible but weaker structural enforcement

The strong hierarchy constraint is non-convex in general, but can be approximated through carefully designed group penalties.

Overlapping Group Lasso

When Groups Overlap

Standard Group Lasso assumes non-overlapping groups, but real structures often have overlapping memberships:

A gene may participate in multiple biological pathways
A word may appear in multiple phrase patterns
A region may belong to multiple administrative hierarchies

The Latent Variable Formulation

For overlapping groups G₁, G₂, ..., G_K, introduce latent variables v⁽ᵍ⁾ for each group:

$$\min_{\boldsymbol{\beta}, {\mathbf{v}^{(g)}}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \lambda \sum{g=1}^{K}|\mathbf{v}^{(g)}|2$$ $$\text{subject to } \beta_j = \sum{g: j \in G_g} v_j^{(g)}$$

Each coefficient βⱼ is the sum of contributions from all groups it belongs to. The penalty encourages entire groups of latent variables to be zero.

Overlapping Group Properties

•Support union structure: Active features form the union of active groups
•Interpretable selection: Can determine which group 'explains' each feature
•Computationally harder: ADMM or proximal gradient with specialized operators
•Flexibility: Can encode complex, non-tree structured relationships

Computational Considerations

Overlapping groups require more sophisticated optimization. The proximal operator no longer has a closed form, typically requiring iterative inner loops. For large-scale problems, consider approximations or specialized algorithms like ADMM with smart splitting.

Sparse Group Lasso

The Best of Both Worlds

Standard Group Lasso has an all-or-nothing property: if a group is selected, all its coefficients are nonzero. But we often want:

Group-level sparsity: Select only relevant groups
Within-group sparsity: Select only relevant features within each group

Sparse Group Lasso achieves both by combining Group Lasso with standard Lasso:

$$\min_{\boldsymbol{\beta}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \alpha\lambda|\boldsymbol{\beta}|1 + (1-\alpha)\lambda\sum{g=1}^{G}\sqrt{p_g}|\boldsymbol{\beta}^{(g)}|_2$$

where α ∈ [0, 1] balances element-wise and group-wise sparsity.

Sparse Group Lasso Spectrum
α Value	Behavior	Use Case
α = 0	Pure Group Lasso	All features in selected groups are relevant
α = 0.5	Balanced	Some group structure, some within-group sparsity
α = 1	Pure Lasso	No group structure, individual selection

sparse_group_lasso.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
 
def sparse_group_soft_threshold(z, lambda1, lambda2):
    """
    Proximal operator for sparse group lasso penalty.
    
    Finds: argmin_v (1/2)||v - z||^2 + lambda1*||v||_1 + lambda2*||v||_2
    
    Parameters:
    -----------
    z : ndarray - Input vector
    lambda1 : float - L1 penalty weight
    lambda2 : float - Group L2 penalty weight
    
    Returns:
    --------
    v : ndarray - Proximal solution
    """
    # First apply element-wise soft thresholding (L1)
    v = np.sign(z) * np.maximum(np.abs(z) - lambda1, 0)
    
    # Then apply group soft thresholding (L2)
    norm_v = np.linalg.norm(v)
    if norm_v > lambda2:
        v = (1 - lambda2 / norm_v) * v
    else:
        v = np.zeros_like(z)
    
    return v
 
def fit_sparse_group_lasso(X, y, groups, lambda_param, alpha=0.5,
                            max_iter=1000, tol=1e-6):
    """
    Fit Sparse Group Lasso via block coordinate descent.
    
    Parameters:
    -----------
    X : ndarray (n, p) - Design matrix
    y : ndarray (n,) - Response
    groups : list of lists - Group structure
    lambda_param : float - Overall regularization
    alpha : float - Balance between L1 and group L2 (0 to 1)
    max_iter : int - Maximum iterations
    tol : float - Convergence tolerance
    
    Returns:
    --------
    beta : ndarray (p,) - Fitted coefficients
    """
    n, p = X.shape
    beta = np.zeros(p)
    
    # Precompute group norms
    group_sizes = [len(g) for g in groups]
    group_norms = [np.linalg.norm(X[:, g], 'fro')**2 / n for g in groups]
    
    for iteration in range(max_iter):
        beta_old = beta.copy()
        
        for g_idx, g in enumerate(groups):
            # Compute partial residual
            r_g = y - X @ beta + X[:, g] @ beta[g]
            
            # Gradient step
            z_g = X[:, g].T @ r_g / n
            
            # Sparse group soft thresholding
            lambda1 = alpha * lambda_param
            lambda2 = (1 - alpha) * lambda_param * np.sqrt(group_sizes[g_idx])
            
            beta[g] = sparse_group_soft_threshold(z_g, lambda1, lambda2)
            beta[g] = beta[g] / group_norms[g_idx]
        
        if np.linalg.norm(beta - beta_old) < tol:
            print(f"Converged at iteration {iteration}")
            break
    
    return beta

Designing Custom Structured Penalties

The General Framework

Structured regularization follows a general pattern:

$$\min_{\boldsymbol{\beta}} \mathcal{L}(\boldsymbol{\beta}) + \Omega(\boldsymbol{\beta})$$

where Ω(β) encodes the structural prior. To design a custom penalty:

Identify the structure: What relationships exist between features?
Choose the penalty form: L1/L2 norms, products of penalties, constraints vs penalties
Verify convexity: Convex penalties ensure tractable optimization
Derive the proximal operator: Enables efficient optimization algorithms
Validate on data: Does the structure improve prediction/interpretation?

Key Design Principles

•Convexity preservation: Sums and compositions with norms generally preserve convexity
•Decomposability: Penalties that decompose over groups enable parallel computation
•Proximal tractability: The proximal operator should have closed form or fast iterative solution
•Interpretability: The penalty should have clear interpretation for domain experts

Common Building Blocks

Norm combinations:

sum → union of effects (at least one penalty is active)
product → intersection of effects (both structures must hold)
infimal convolution → choose best fit from multiple forms

Linear transformations:

Ω(Dβ) for difference matrix D → smoothness on differences
Ω(Aβ) for constraint matrix A → linear constraints on coefficients

Weighted penalties:

Ω(β) = Σⱼ wⱼ|βⱼ| for adaptive weights → different shrinkage per feature
Ω(β) = Σᵍ wᵍ||β⁽ᵍ⁾||₂ for group weights → prior importance of groups

When to Use Structured Penalties

Use structured regularization when: (1) Genuine prior knowledge exists about feature relationships, (2) Interpretability of the structure is important, (3) Standard methods fail to exploit known patterns, (4) The structure is reliably known (incorrect structure can hurt performance).

Summary

Key Takeaways

•Graph Laplacian regularization encourages connected features to have similar coefficients, leveraging network structure
•Hierarchical penalties enforce parent-before-child selection, respecting taxonomic or organizational structures
•Overlapping Group Lasso handles features belonging to multiple groups via latent variable decomposition
•Sparse Group Lasso combines group selection with within-group sparsity for flexible structured sparsity
•Custom penalty design follows principles: encode structure, ensure convexity, derive proximal operator, validate empirically

Module Complete

Congratulations! You've completed Module 6: Advanced Regularization. You now possess a comprehensive understanding of regularization beyond basic Ridge and Lasso—from Group Lasso and Fused Lasso through Bridge Regression, non-convex penalties, and structured regularization. These tools enable you to tackle complex real-world problems where domain structure matters.