Advanced Regularization - Learning Module

Loading content...

0/245

Fused Lasso

Enforcing Smoothness in Ordered Data

Many real-world problems involve data with natural ordering—time series, genomic positions along a chromosome, spatial coordinates in images, or spectral wavelengths. In these settings, we often expect the underlying signal to be piecewise constant or smoothly varying: adjacent measurements should have similar values, with occasional abrupt changes at meaningful boundaries.

Standard Lasso ignores this structure entirely—it treats coefficient β₃ as unrelated to β₂ and β₄. But when features are ordered, this independence assumption wastes valuable information.

Fused Lasso (also called the Total Variation penalty in signal processing) addresses this by penalizing differences between consecutive coefficients:

$$|\boldsymbol{\beta}|{TV} = \sum{j=2}^{p} |\beta_j - \beta_{j-1}|$$

This penalty encourages adjacent coefficients to be equal, naturally producing piecewise constant solutions—exactly what we need for detecting change points in time series, identifying copy number variations in genomics, or denoising step functions.

What You Will Learn

By the end of this page, you will understand the Fused Lasso formulation, its dual representation, specialized optimization algorithms, and applications to signal denoising, change point detection, and image processing. You'll recognize when ordered structure in your features calls for fusion penalties.

Mathematical Formulation

The Fused Lasso Objective

The Fused Lasso combines a standard sparsity penalty with a fusion penalty on consecutive differences:

$$\min_{\boldsymbol{\beta}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \lambda_1 \sum{j=1}^{p}|\beta_j| + \lambda_2 \sum_{j=2}^{p}|\beta_j - \beta_{j-1}|$$

The three components serve distinct purposes:

Fused Lasso Components

•Squared error loss: Ensures fidelity to the data—predictions should match observations
•Sparsity penalty (λ₁): Standard L1 penalty encouraging individual coefficients to be zero
•Fusion penalty (λ₂): L1 penalty on consecutive differences, encouraging adjacent coefficients to be equal

Special Cases

The two regularization parameters (λ₁, λ₂) offer flexibility:

λ₁ > 0, λ₂ = 0: Reduces to standard Lasso
λ₁ = 0, λ₂ > 0: Pure fusion penalty (Total Variation denoising)
λ₁ > 0, λ₂ > 0: Full Fused Lasso with both sparsity and smoothness

The Signal Plus Noise Model

Fused Lasso is particularly natural for the signal denoising problem where X = I (identity):

$$\min_{\boldsymbol{\beta}} \frac{1}{2}|\mathbf{y} - \boldsymbol{\beta}|2^2 + \lambda \sum{j=2}^{p}|\beta_j - \beta_{j-1}|$$

Here, y represents a noisy observation of an unknown piecewise constant signal β. The fusion penalty recovers the underlying step structure while smoothing away noise.

Total Variation in 1D

The term 'Total Variation' comes from measuring the total amount of change in a signal: TV(β) = Σ|βⱼ - βⱼ₋₁|. Minimizing TV while fitting data produces piecewise constant reconstructions—the signal changes only where the data strongly supports a change.

Difference Matrix Representation

Matrix Form of the Fusion Penalty

The fusion penalty can be elegantly expressed using a difference matrix D ∈ ℝ⁽ᵖ⁻¹⁾ˣᵖ:

$$\mathbf{D} = \begin{pmatrix} -1 & 1 & 0 & \cdots & 0 \ 0 & -1 & 1 & \cdots & 0 \ \vdots & & \ddots & \ddots & \vdots \ 0 & \cdots & 0 & -1 & 1 \end{pmatrix}$$

With this matrix, the fusion penalty becomes:

$$\sum_{j=2}^{p}|\beta_j - \beta_{j-1}| = |\mathbf{D}\boldsymbol{\beta}|_1$$

And the Fused Lasso objective takes the compact form:

$$\min_{\boldsymbol{\beta}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda_1|\boldsymbol{\beta}|_1 + \lambda_2|\mathbf{D}\boldsymbol{\beta}|_1$$

Generalized Lasso Framework

This matrix representation reveals Fused Lasso as a special case of the Generalized Lasso:

$$\min_{\boldsymbol{\beta}} \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\mathbf{D}\boldsymbol{\beta}|_1$$

Different choices of D yield different regularizers:

Matrix D	Penalty Type	Effect
Identity I	Standard Lasso	Element-wise sparsity
First differences	Fused Lasso / TV-1D	Piecewise constant
Second differences	Trend filtering	Piecewise linear
Graph Laplacian	Graph-based fusion	Smooth over graph structure

This unifying framework extends Fused Lasso concepts to arbitrary penalty matrices.

Higher-Order Differences

Using second-order differences D² produces piecewise linear solutions (trend filtering of order 1). Third-order differences produce piecewise quadratic solutions. The order of the difference operator controls the smoothness class of the fitted signal.

Properties of Fused Lasso Solutions

Piecewise Constant Structure

The defining characteristic of Fused Lasso solutions is their piecewise constant nature. The solution β* partitions the coefficients into contiguous blocks:

$$\beta^_1 = \beta^2 = \ldots = \beta^*{k_1} \neq \beta^_{k_1+1} = \ldots = \beta^_{k_2} \neq \ldots$$

Within each block, all coefficients share the same value. The number of distinct blocks decreases as λ₂ increases:

λ₂ = 0: Each coefficient can take a different value (p blocks maximum)
λ₂ → ∞: All coefficients fuse to a single value (1 block)

The Degrees of Freedom

An important result: the degrees of freedom of the Fused Lasso solution (with λ₁ = 0) equals the number of distinct blocks, not the number of nonzero coefficients. This is crucial for model selection criteria like AIC and BIC.

Effect of Regularization Parameters
λ₁ (Sparsity)	λ₂ (Fusion)	Solution Characteristics
Low	Low	Many distinct nonzero values, flexible fit
Low	High	Few distinct values, large constant blocks
High	Low	Sparse (many zeros), remaining values distinct
High	High	Sparse with fused nonzero regions

Change Point Detection Interpretation

When Dβ has a nonzero entry at position j, this indicates a change point where βⱼ₊₁ ≠ βⱼ. Thus, Fused Lasso simultaneously:

Estimates the signal values within each constant region
Detects change points where the signal jumps
Determines the number of change points (controlled by λ₂)

This dual role—estimation and detection—makes Fused Lasso powerful for structural discovery in ordered data.

Bias at Change Points

Like standard Lasso, Fused Lasso introduces bias. The fusion penalty shrinks differences toward zero, so estimated change point magnitudes are attenuated. For precise jump size estimation, consider refitting without penalty after detecting change point locations.

Optimization Algorithms

Path Algorithm for 1D Fused Lasso

For the pure denoising case (X = I, λ₁ = 0), a remarkable path algorithm computes the exact solution for all values of λ in O(n) time. The key insight: as λ increases, adjacent coefficients merge in a specific order that can be tracked efficiently.

The algorithm maintains a partition of coefficients into fused groups. As λ grows:

Compute the λ value at which two adjacent groups would merge
Jump to the next merge event
Update group memberships and continue

This produces the entire solution path β(λ) for λ ∈ [0, ∞) in linear time.

fused_lasso_dp.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
 
def fused_lasso_1d(y, lambda_param):
    """
    1D Fused Lasso (Total Variation Denoising) via dynamic programming.
    Solves: min_beta (1/2)||y - beta||^2 + lambda * TV(beta)
    
    Uses the taut string algorithm for O(n) complexity.
    
    Parameters:
    -----------
    y : ndarray of shape (n,) - Noisy signal
    lambda_param : float - Regularization parameter
    
    Returns:
    --------
    beta : ndarray of shape (n,) - Denoised piecewise constant signal
    """
    n = len(y)
    if n == 0:
        return np.array([])
    
    # Cumulative sum for efficient interval means
    cumsum_y = np.concatenate([[0], np.cumsum(y)])
    
    # Dynamic programming approach
    # Track upper and lower tube boundaries
    beta = np.zeros(n)
    
    # Forward pass: compute optimal values
    lower = y[0] - lambda_param
    upper = y[0] + lambda_param
    
    for i in range(1, n):
        if y[i] < lower:
            beta[i-1] = lower
            lower = y[i] - lambda_param
            upper = y[i] + lambda_param
        elif y[i] > upper:
            beta[i-1] = upper
            lower = y[i] - lambda_param
            upper = y[i] + lambda_param
        else:
            lower = max(lower, y[i] - lambda_param)
            upper = min(upper, y[i] + lambda_param)
    
    # Last value
    beta[n-1] = np.clip(y[n-1], lower, upper)
    
    # Backward pass: ensure piecewise constant
    for i in range(n-2, -1, -1):
        beta[i] = np.clip(beta[i], beta[i+1] - lambda_param, 
                          beta[i+1] + lambda_param)
        if beta[i] == 0:
            beta[i] = beta[i+1]
    
    return beta

ADMM for General Fused Lasso

For the full problem with arbitrary X, Alternating Direction Method of Multipliers (ADMM) is effective. Introducing auxiliary variable z = Dβ, we solve:

$$\min_{\beta, z} \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda_1|\boldsymbol{\beta}|_1 + \lambda_2|\mathbf{z}|_1 \quad \text{s.t.} \quad \mathbf{D}\boldsymbol{\beta} = \mathbf{z}$$

ADMM alternates:

β-update: Ridge-like problem with soft-thresholding
z-update: Soft-thresholding of Dβ
Dual update: Gradient ascent on Lagrange multipliers

Convergence is guaranteed for convex problems, though the rate can be slow without acceleration.

Applications

Signal Denoising

The canonical application: recovering a piecewise constant signal from noisy observations. Given noisy measurements y = β* + ε, Fused Lasso estimates the true signal β* while detecting discontinuities.

When to Use Fused Lasso

•Data has natural ordering (time, space, sequence)
•Underlying signal expected to be piecewise constant
•Change point detection is of interest
•Smoothness within regions is desirable

When NOT to Use Fused Lasso

•Features have no meaningful order
•Signal is continuous/smooth (use trend filtering)
•Computational budget is very limited for large p
•True signal has gradual transitions

Copy Number Variation Detection

In genomics, copy number variations (CNVs) are regions where DNA segments are duplicated or deleted. Along a chromosome, copy number is approximately piecewise constant with jumps at CNV boundaries.

Fused Lasso applied to array CGH or sequencing data:

Identifies CNV boundaries (change points)
Estimates copy number within each region
Controls false positive rate via regularization

Image Denoising (2D Extension)

Extending to 2D images, the Total Variation penalty sums absolute differences in both spatial directions:

$$|\mathbf{I}|{TV} = \sum{i,j} (|I_{i+1,j} - I_{i,j}| + |I_{i,j+1} - I_{i,j}|)$$

This produces cartoon-like denoised images with sharp edges but flat regions—ideal for preserving boundaries while removing noise.

Summary

Key Takeaways

•Fused Lasso penalizes differences between consecutive coefficients, encouraging piecewise constant solutions
•Two parameters control sparsity (λ₁) and smoothness/fusion (λ₂) independently
•Difference matrix D provides elegant matrix formulation and connects to Generalized Lasso
•O(n) path algorithms exist for 1D denoising; ADMM handles general design matrices
•Primary applications: signal denoising, change point detection, CNV analysis, image processing

Page Complete

You now understand Fused Lasso—the essential tool for regularization in ordered data. Next, we'll explore Bridge Regression, which generalizes the L1 and L2 penalties to the Lq norm family, revealing the geometric continuum between Lasso and Ridge.