Loading content...
Many real-world problems involve data with natural ordering—time series, genomic positions along a chromosome, spatial coordinates in images, or spectral wavelengths. In these settings, we often expect the underlying signal to be piecewise constant or smoothly varying: adjacent measurements should have similar values, with occasional abrupt changes at meaningful boundaries.
Standard Lasso ignores this structure entirely—it treats coefficient β₃ as unrelated to β₂ and β₄. But when features are ordered, this independence assumption wastes valuable information.
Fused Lasso (also called the Total Variation penalty in signal processing) addresses this by penalizing differences between consecutive coefficients:
$$|\boldsymbol{\beta}|{TV} = \sum{j=2}^{p} |\beta_j - \beta_{j-1}|$$
This penalty encourages adjacent coefficients to be equal, naturally producing piecewise constant solutions—exactly what we need for detecting change points in time series, identifying copy number variations in genomics, or denoising step functions.
By the end of this page, you will understand the Fused Lasso formulation, its dual representation, specialized optimization algorithms, and applications to signal denoising, change point detection, and image processing. You'll recognize when ordered structure in your features calls for fusion penalties.
The Fused Lasso combines a standard sparsity penalty with a fusion penalty on consecutive differences:
$$\min_{\boldsymbol{\beta}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 + \lambda_1 \sum{j=1}^{p}|\beta_j| + \lambda_2 \sum_{j=2}^{p}|\beta_j - \beta_{j-1}|$$
The three components serve distinct purposes:
The two regularization parameters (λ₁, λ₂) offer flexibility:
Fused Lasso is particularly natural for the signal denoising problem where X = I (identity):
$$\min_{\boldsymbol{\beta}} \frac{1}{2}|\mathbf{y} - \boldsymbol{\beta}|2^2 + \lambda \sum{j=2}^{p}|\beta_j - \beta_{j-1}|$$
Here, y represents a noisy observation of an unknown piecewise constant signal β. The fusion penalty recovers the underlying step structure while smoothing away noise.
The term 'Total Variation' comes from measuring the total amount of change in a signal: TV(β) = Σ|βⱼ - βⱼ₋₁|. Minimizing TV while fitting data produces piecewise constant reconstructions—the signal changes only where the data strongly supports a change.
The fusion penalty can be elegantly expressed using a difference matrix D ∈ ℝ⁽ᵖ⁻¹⁾ˣᵖ:
$$\mathbf{D} = \begin{pmatrix} -1 & 1 & 0 & \cdots & 0 \ 0 & -1 & 1 & \cdots & 0 \ \vdots & & \ddots & \ddots & \vdots \ 0 & \cdots & 0 & -1 & 1 \end{pmatrix}$$
With this matrix, the fusion penalty becomes:
$$\sum_{j=2}^{p}|\beta_j - \beta_{j-1}| = |\mathbf{D}\boldsymbol{\beta}|_1$$
And the Fused Lasso objective takes the compact form:
$$\min_{\boldsymbol{\beta}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda_1|\boldsymbol{\beta}|_1 + \lambda_2|\mathbf{D}\boldsymbol{\beta}|_1$$
This matrix representation reveals Fused Lasso as a special case of the Generalized Lasso:
$$\min_{\boldsymbol{\beta}} \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\mathbf{D}\boldsymbol{\beta}|_1$$
Different choices of D yield different regularizers:
| Matrix D | Penalty Type | Effect |
|---|---|---|
| Identity I | Standard Lasso | Element-wise sparsity |
| First differences | Fused Lasso / TV-1D | Piecewise constant |
| Second differences | Trend filtering | Piecewise linear |
| Graph Laplacian | Graph-based fusion | Smooth over graph structure |
This unifying framework extends Fused Lasso concepts to arbitrary penalty matrices.
Using second-order differences D² produces piecewise linear solutions (trend filtering of order 1). Third-order differences produce piecewise quadratic solutions. The order of the difference operator controls the smoothness class of the fitted signal.
The defining characteristic of Fused Lasso solutions is their piecewise constant nature. The solution β* partitions the coefficients into contiguous blocks:
$$\beta^_1 = \beta^2 = \ldots = \beta^*{k_1} \neq \beta^_{k_1+1} = \ldots = \beta^_{k_2} \neq \ldots$$
Within each block, all coefficients share the same value. The number of distinct blocks decreases as λ₂ increases:
An important result: the degrees of freedom of the Fused Lasso solution (with λ₁ = 0) equals the number of distinct blocks, not the number of nonzero coefficients. This is crucial for model selection criteria like AIC and BIC.
| λ₁ (Sparsity) | λ₂ (Fusion) | Solution Characteristics |
|---|---|---|
| Low | Low | Many distinct nonzero values, flexible fit |
| Low | High | Few distinct values, large constant blocks |
| High | Low | Sparse (many zeros), remaining values distinct |
| High | High | Sparse with fused nonzero regions |
When Dβ has a nonzero entry at position j, this indicates a change point where βⱼ₊₁ ≠ βⱼ. Thus, Fused Lasso simultaneously:
This dual role—estimation and detection—makes Fused Lasso powerful for structural discovery in ordered data.
Like standard Lasso, Fused Lasso introduces bias. The fusion penalty shrinks differences toward zero, so estimated change point magnitudes are attenuated. For precise jump size estimation, consider refitting without penalty after detecting change point locations.
For the pure denoising case (X = I, λ₁ = 0), a remarkable path algorithm computes the exact solution for all values of λ in O(n) time. The key insight: as λ increases, adjacent coefficients merge in a specific order that can be tracked efficiently.
The algorithm maintains a partition of coefficients into fused groups. As λ grows:
This produces the entire solution path β(λ) for λ ∈ [0, ∞) in linear time.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as np def fused_lasso_1d(y, lambda_param): """ 1D Fused Lasso (Total Variation Denoising) via dynamic programming. Solves: min_beta (1/2)||y - beta||^2 + lambda * TV(beta) Uses the taut string algorithm for O(n) complexity. Parameters: ----------- y : ndarray of shape (n,) - Noisy signal lambda_param : float - Regularization parameter Returns: -------- beta : ndarray of shape (n,) - Denoised piecewise constant signal """ n = len(y) if n == 0: return np.array([]) # Cumulative sum for efficient interval means cumsum_y = np.concatenate([[0], np.cumsum(y)]) # Dynamic programming approach # Track upper and lower tube boundaries beta = np.zeros(n) # Forward pass: compute optimal values lower = y[0] - lambda_param upper = y[0] + lambda_param for i in range(1, n): if y[i] < lower: beta[i-1] = lower lower = y[i] - lambda_param upper = y[i] + lambda_param elif y[i] > upper: beta[i-1] = upper lower = y[i] - lambda_param upper = y[i] + lambda_param else: lower = max(lower, y[i] - lambda_param) upper = min(upper, y[i] + lambda_param) # Last value beta[n-1] = np.clip(y[n-1], lower, upper) # Backward pass: ensure piecewise constant for i in range(n-2, -1, -1): beta[i] = np.clip(beta[i], beta[i+1] - lambda_param, beta[i+1] + lambda_param) if beta[i] == 0: beta[i] = beta[i+1] return betaFor the full problem with arbitrary X, Alternating Direction Method of Multipliers (ADMM) is effective. Introducing auxiliary variable z = Dβ, we solve:
$$\min_{\beta, z} \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda_1|\boldsymbol{\beta}|_1 + \lambda_2|\mathbf{z}|_1 \quad \text{s.t.} \quad \mathbf{D}\boldsymbol{\beta} = \mathbf{z}$$
ADMM alternates:
Convergence is guaranteed for convex problems, though the rate can be slow without acceleration.
The canonical application: recovering a piecewise constant signal from noisy observations. Given noisy measurements y = β* + ε, Fused Lasso estimates the true signal β* while detecting discontinuities.
In genomics, copy number variations (CNVs) are regions where DNA segments are duplicated or deleted. Along a chromosome, copy number is approximately piecewise constant with jumps at CNV boundaries.
Fused Lasso applied to array CGH or sequencing data:
Extending to 2D images, the Total Variation penalty sums absolute differences in both spatial directions:
$$|\mathbf{I}|{TV} = \sum{i,j} (|I_{i+1,j} - I_{i,j}| + |I_{i,j+1} - I_{i,j}|)$$
This produces cartoon-like denoised images with sharp edges but flat regions—ideal for preserving boundaries while removing noise.
You now understand Fused Lasso—the essential tool for regularization in ordered data. Next, we'll explore Bridge Regression, which generalizes the L1 and L2 penalties to the Lq norm family, revealing the geometric continuum between Lasso and Ridge.