Robust Regression - Learning Module

Loading content...

0/245

Huber Loss: The Bridge Between L1 and L2

When Least Squares Fails

Ordinary Least Squares (OLS) regression is one of the most elegant and widely-used statistical tools ever developed. Its mathematical properties are beautiful: closed-form solutions, unbiased estimates, and optimal efficiency under Gaussian errors. Yet in the trenches of real-world data analysis, OLS harbors a dangerous vulnerability—a single outlier can completely corrupt your model.

Consider fitting a simple linear regression to predict housing prices. 99 homes in your dataset follow a clean linear relationship, but one data entry error records a $500,000 home as $50,000,000. With OLS, that single point doesn't just influence your model—it dominates it. The resulting regression line will twist dramatically toward the outlier, producing wildly inaccurate predictions for every other observation.

This vulnerability isn't a minor inconvenience. In real-world applications—fraud detection, medical diagnostics, autonomous systems, financial modeling—outliers are not exceptions but certainties. Sensor malfunctions, data entry errors, genuine extreme events, and adversarial inputs all generate observations that violate the assumptions underlying OLS.

Robust regression methods address this fundamental problem. They produce reliable estimates even when a fraction of the data is corrupted or comes from a different distribution entirely. At the heart of modern robust regression lies a deceptively simple idea: replace the squared loss with something less sensitive to large residuals.

The Huber loss function is the canonical solution—a hybrid loss that behaves like squared error for small residuals and like absolute error for large ones. This page provides an exhaustive treatment of Huber loss, from its theoretical foundations to practical implementation considerations.

What You Will Learn

By the end of this page, you will understand why squared loss is sensitive to outliers, how Huber loss provides a mathematically principled compromise between L1 and L2 losses, the complete derivation and properties of Huber loss, and how to implement and tune Huber regression in practice.

The Problem with Squared Loss

To understand why Huber loss exists, we must first deeply understand what goes wrong with ordinary squared loss. The OLS objective minimizes the sum of squared residuals:

$$\mathcal{L}{\text{OLS}}(\boldsymbol{\beta}) = \sum{i=1}^{n} (y_i - \mathbf{x}i^\top \boldsymbol{\beta})^2 = \sum{i=1}^{n} r_i^2$$

where $r_i = y_i - \mathbf{x}_i^\top \boldsymbol{\beta}$ is the residual for observation $i$.

The quadratic penalty problem:

When we square the residuals, we impose a penalty that grows quadratically with the size of the error. This seems reasonable at first—larger errors should receive larger penalties. But the quadratic growth rate creates severe problems:

Quadratic Penalty Growth
Residual Size	Squared Penalty	Interpretation
$r = 1$	$r^2 = 1$	Normal observation, unit penalty
$r = 2$	$r^2 = 4$	2x error → 4x penalty
$r = 10$	$r^2 = 100$	10x error → 100x penalty
$r = 100$	$r^2 = 10,000$	One outlier dominates entire objective

Mathematical analysis of influence:

To understand outlier sensitivity rigorously, we examine the gradient of the squared loss with respect to the prediction:

$$\frac{\partial}{\partial \hat{y}} (y - \hat{y})^2 = -2(y - \hat{y}) = -2r$$

The gradient is linear in the residual. This means:

Large residuals produce large gradients
Large gradients drive large parameter updates
A single outlier can shift the entire model

In optimization terms, outliers create extreme gradients that pull the solution toward them. The model sacrifices fit quality on many observations to reduce error on a few extreme ones.

The statistical perspective:

Squared loss is optimal when errors follow a Gaussian distribution—this is the maximum likelihood estimator under $\epsilon \sim \mathcal{N}(0, \sigma^2)$. But real-world error distributions often have heavy tails—more extreme values than a Gaussian would predict. Common heavy-tailed distributions include:

Student's t-distribution (financial returns, measurement errors)
Cauchy distribution (ratio of Gaussians, resonance phenomena)
Laplace distribution (double exponential, L1 loss MLE)
Contaminated Gaussians (mixture of normal data and outliers)

Under heavy-tailed errors, OLS is no longer optimal—and can be catastrophically inefficient. The sample mean (equivalent to intercept-only OLS) has infinite variance for Cauchy-distributed data, meaning more data doesn't help improve the estimate.

The Breakdown Point of OLS

OLS has a breakdown point of 1/n—meaning a single observation can make the estimator arbitrarily bad. As n grows, this fraction shrinks, but the fundamental vulnerability remains. One adversarially-placed point can always corrupt the model completely.

From L2 to L1: The Robustness Spectrum

Before introducing Huber loss, let's examine the natural alternative to squared loss: absolute loss (L1 loss, also called Least Absolute Deviations or LAD).

$$\mathcal{L}{\text{LAD}}(\boldsymbol{\beta}) = \sum{i=1}^{n} |y_i - \mathbf{x}i^\top \boldsymbol{\beta}| = \sum{i=1}^{n} |r_i|$$

The L1 loss penalizes residuals linearly rather than quadratically:

L1 vs L2 Penalty Comparison
Residual	L1 Penalty $\|r\|$	L2 Penalty $r^2$	L2/L1 Ratio
$r = 0.1$	0.1	0.01	0.1x (L2 gentler)
$r = 1$	1	1	1x (equal)
$r = 10$	10	100	10x (L2 harsher)
$r = 100$	100	10,000	100x (L2 dominates)

Properties of L1 loss:

The L1 loss has remarkable robustness properties:

Bounded influence: The gradient of L1 loss is $\frac{\partial |r|}{\partial \hat{y}} = -\text{sign}(r)$, which has magnitude 1 regardless of residual size. Outliers cannot exert disproportionate influence.
Median regression: The L1 estimator finds the conditional median rather than the conditional mean. The median is far more robust to outliers than the mean.
Breakdown point: L1 regression can tolerate nearly 50% outliers before breaking down (exact breakdown point depends on design matrix geometry).

But L1 has drawbacks:

Non-differentiability: The absolute value function has a kink at zero: $|r|$ is not differentiable at $r=0$. This complicates optimization—standard gradient descent doesn't apply directly.
Efficiency loss: When errors truly are Gaussian, L1 is less efficient than L2. You need approximately 50% more data to achieve the same precision with L1 vs L2 under Gaussian errors.
Non-unique solutions: L1 solutions may not be unique when the data has certain configurations—a technical nuisance for interpretation.

The Fundamental Tradeoff

We face a tradeoff: L2 loss is efficient and smooth but sensitive to outliers. L1 loss is robust and bounded-influence but non-smooth and less efficient. Huber loss provides a principled middle ground—smooth everywhere while still limiting outlier influence.

The Huber Loss Function

Peter Huber introduced his eponymous loss function in 1964 as a way to achieve the "best of both worlds"—efficiency of L2 near zero while maintaining robustness of L1 for large residuals.

Definition:

The Huber loss function $\rho_\delta(r)$ is defined as:

$$\rho_\delta(r) = \begin{cases} \frac{1}{2}r^2 & \text{if } |r| \leq \delta \ \delta |r| - \frac{1}{2}\delta^2 & \text{if } |r| > \delta \end{cases}$$

where $\delta > 0$ is the threshold parameter (also called the tuning constant or transition point).

Intuition:

For small residuals ($|r| \leq \delta$): Use quadratic loss, just like OLS
For large residuals ($|r| > \delta$): Switch to linear loss, like LAD

The transition point $\delta$ determines what counts as "small" vs "large".

Why This Specific Form?

The linear portion is written as $\delta|r| - \frac{1}{2}\delta^2$ rather than simply $|r|$ to ensure the function is continuous at $r = \pm\delta$. At the transition point: $\frac{1}{2}\delta^2 = \delta \cdot \delta - \frac{1}{2}\delta^2$, confirming continuity.

The derivative (influence function):

The derivative of Huber loss determines how much each observation influences the parameter estimates:

$$\psi_\delta(r) = \frac{\partial \rho_\delta(r)}{\partial r} = \begin{cases} r & \text{if } |r| \leq \delta \ \delta \cdot \text{sign}(r) & \text{if } |r| > \delta \end{cases}$$

This is called the Huber psi-function or influence function. Key observations:

Continuous at transition: At $r = \delta$: from below, $\psi = \delta$; from above, $\psi = \delta \cdot 1 = \delta$. The function is continuous.
Bounded influence: For $|r| > \delta$, the influence is capped at $\pm\delta$. No matter how extreme an outlier, its maximum influence is bounded.
Smooth near zero: Unlike L1 loss, Huber loss is differentiable everywhere, including at zero. This enables standard gradient-based optimization.

The second derivative:

$$\psi'\delta(r) = \frac{\partial^2 \rho\delta(r)}{\partial r^2} = \begin{cases} 1 & \text{if } |r| < \delta \ 0 & \text{if } |r| > \delta \end{cases}$$

The second derivative is discontinuous at $\pm\delta$, so Huber loss is $C^1$ but not $C^2$ (once but not twice continuously differentiable). This is sufficient for most optimization algorithms.

huber_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
import matplotlib.pyplot as plt
 
def huber_loss(r, delta=1.0):
    """
    Compute Huber loss for residuals r.
    
    Parameters:
    -----------
    r : array-like
        Residuals (y - y_pred)
    delta : float
        Threshold parameter controlling transition from L2 to L1
    
    Returns:
    --------
    loss : array-like
        Huber loss values
    """
    abs_r = np.abs(r)
    quadratic_mask = abs_r <= delta
    
    loss = np.where(
        quadratic_mask,
        0.5 * r**2,                        # Quadratic for |r| <= delta
        delta * abs_r - 0.5 * delta**2     # Linear for |r| > delta
    )
    return loss
 
def huber_gradient(r, delta=1.0):
    """
    Compute gradient of Huber loss (the psi function).
    
    Parameters:
    -----------
    r : array-like
        Residuals (y - y_pred)
    delta : float
        Threshold parameter
    
    Returns:
    --------
    grad : array-like
        Gradient values (influence function)
    """
    return np.where(
        np.abs(r) <= delta,
        r,                          # Linear gradient for small residuals
        delta * np.sign(r)          # Bounded gradient for large residuals
    )
 
# Visualization comparing L1, L2, and Huber losses
r = np.linspace(-5, 5, 1000)
delta = 1.35  # Common default value
 
l2_loss = 0.5 * r**2
l1_loss = np.abs(r)
huber = huber_loss(r, delta=delta)
 
plt.figure(figsize=(12, 5))
 
# Plot losses
plt.subplot(1, 2, 1)
plt.plot(r, l2_loss, 'b--', label='L2 (Squared)', linewidth=2)
plt.plot(r, l1_loss, 'g--', label='L1 (Absolute)', linewidth=2)
plt.plot(r, huber, 'r-', label=f'Huber (δ={delta})', linewidth=2)
plt.axvline(x=delta, color='gray', linestyle=':', alpha=0.7)
plt.axvline(x=-delta, color='gray', linestyle=':', alpha=0.7)
plt.xlabel('Residual (r)')
plt.ylabel('Loss')
plt.title('Loss Function Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(-5, 5)
plt.ylim(0, 10)
 
# Plot gradients (influence functions)
plt.subplot(1, 2, 2)
plt.plot(r, r, 'b--', label='L2 gradient (unbounded)', linewidth=2)
plt.plot(r, np.sign(r), 'g--', label='L1 gradient (discontinuous)', linewidth=2)
plt.plot(r, huber_gradient(r, delta), 'r-', label=f'Huber gradient (bounded)', linewidth=2)
plt.axhline(y=delta, color='gray', linestyle=':', alpha=0.7)
plt.axhline(y=-delta, color='gray', linestyle=':', alpha=0.7)
plt.xlabel('Residual (r)')
plt.ylabel('Gradient / Influence')
plt.title('Influence Function Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.show()
 
print(f"Key insight: Huber gradient is bounded at ±{delta}")
print(f"L2 gradient at r=100: {100}")
print(f"Huber gradient at r=100: {huber_gradient(100, delta)}")

Choosing the Threshold Parameter δ

The threshold parameter $\delta$ is the critical tuning knob of Huber regression. It controls the tradeoff between efficiency (using more of the smooth L2 region) and robustness (treating more residuals as potential outliers).

Extreme cases:

$\delta \to \infty$: Huber loss approaches L2 loss (OLS behavior)
$\delta \to 0$: Huber loss approaches L1 loss (LAD behavior)

Practical guidelines:

The standard choice in robust statistics is $\delta = 1.345\sigma$, where $\sigma$ is the standard deviation of the errors. This value has a special property: 95% asymptotic efficiency under Gaussian errors.

The Magic Number 1.345

When δ = 1.345σ, Huber estimation achieves 95% of the efficiency of OLS when errors are truly Gaussian, while gaining substantial protection against outliers. This means you only 'pay' 5% in efficiency for significant robustness gains.

The chicken-and-egg problem:

To set $\delta = 1.345\sigma$, we need to know $\sigma$—but we're trying to estimate the model to get residuals to estimate $\sigma$. This circular dependency is resolved through iterative reweighting:

Initial estimate: Run OLS or LAD to get initial residuals
Robust scale: Estimate $\sigma$ using a robust scale estimator (e.g., MAD)
Update δ: Set $\delta = 1.345 \cdot \hat{\sigma}_{\text{robust}}$
Re-estimate: Fit Huber regression with updated $\delta$
Iterate: Repeat 2-4 until convergence

Robust scale estimation:

The Median Absolute Deviation (MAD) provides a robust estimate of scale:

$$\text{MAD} = \text{median}(|r_i - \text{median}(r_i)|)$$

Under Gaussian errors: $\sigma \approx 1.4826 \cdot \text{MAD}$

The factor 1.4826 makes MAD a consistent estimator of $\sigma$ for Gaussian data.

Effect of δ on Huber Regression Behavior
δ Value	Behavior	Efficiency (Gaussian)	Robustness
δ → 0	Pure L1 (LAD)	~64%	Maximum (50% breakdown)
δ = 0.5σ	Very aggressive outlier detection	~70%	Very high
δ = 1.0σ	Moderate	~88%	High
δ = 1.345σ	Standard choice	~95%	Good
δ = 2.0σ	Conservative	~99%	Moderate
δ → ∞	Pure L2 (OLS)	100%	Minimal (0% breakdown)

Huber Regression Optimization

Unlike OLS, Huber regression does not have a closed-form solution. The non-differentiable transition at $\pm\delta$ (in the second derivative) prevents simple matrix algebra. However, the objective is convex, ensuring that any local minimum is the global minimum.

Iteratively Reweighted Least Squares (IRLS):

The most elegant approach is IRLS, which solves Huber regression through a sequence of weighted least squares problems.

Key insight: The Huber estimating equation can be written as:

$$\sum_{i=1}^{n} \psi_\delta(r_i) \mathbf{x}_i = \mathbf{0}$$

This is equivalent to a weighted least squares problem where the weights are:

$$w_i = \frac{\psi_\delta(r_i)}{r_i} = \begin{cases} 1 & \text{if } |r_i| \leq \delta \ \frac{\delta}{|r_i|} & \text{if } |r_i| > \delta \end{cases}$$

Observations with small residuals receive full weight (1), while outliers receive down-weighted influence proportional to $1/|r_i|$.

huber_regression_irls.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
from scipy import linalg
 
def huber_weights(residuals, delta):
    """Compute IRLS weights for Huber regression."""
    abs_r = np.abs(residuals)
    weights = np.where(
        abs_r <= delta,
        1.0,                           # Full weight for small residuals
        delta / np.maximum(abs_r, 1e-10)  # Downweight outliers
    )
    return weights
 
def huber_regression_irls(X, y, delta=1.345, max_iter=50, tol=1e-6):
    """
    Huber regression via Iteratively Reweighted Least Squares.
    
    Parameters:
    -----------
    X : array, shape (n_samples, n_features)
        Feature matrix (should include intercept column if desired)
    y : array, shape (n_samples,)
        Target values
    delta : float
        Huber threshold (use delta * MAD for adaptive threshold)
    max_iter : int
        Maximum iterations
    tol : float
        Convergence tolerance on coefficient change
    
    Returns:
    --------
    beta : array, shape (n_features,)
        Estimated coefficients
    history : dict
        Convergence information
    """
    n, p = X.shape
    
    # Initial estimate via OLS
    beta = linalg.lstsq(X, y)[0]
    
    history = {'iterations': 0, 'converged': False, 'weights': []}
    
    for iteration in range(max_iter):
        # Compute residuals
        residuals = y - X @ beta
        
        # Robust scale estimate (MAD)
        mad = np.median(np.abs(residuals - np.median(residuals)))
        sigma_hat = 1.4826 * mad if mad > 0 else 1.0
        
        # Adjusted threshold
        delta_scaled = delta * sigma_hat
        
        # Compute Huber weights
        weights = huber_weights(residuals, delta_scaled)
        
        # Weighted least squares: (X'WX)^{-1} X'Wy
        W = np.diag(weights)
        XtW = X.T @ W
        beta_new = linalg.solve(XtW @ X, XtW @ y)
        
        # Check convergence
        change = np.max(np.abs(beta_new - beta))
        history['weights'].append(weights.copy())
        
        if change < tol:
            history['converged'] = True
            history['iterations'] = iteration + 1
            beta = beta_new
            break
        
        beta = beta_new
    
    history['final_residuals'] = y - X @ beta
    history['final_weights'] = weights
    
    return beta, history
 
# Example: Robust regression with outliers
np.random.seed(42)
n = 100
 
# Generate clean data
X_clean = np.random.randn(n, 1)
X = np.column_stack([np.ones(n), X_clean])  # Add intercept
y_clean = 2 + 3 * X_clean.flatten() + 0.5 * np.random.randn(n)
 
# Add outliers (10% contamination)
n_outliers = 10
outlier_idx = np.random.choice(n, n_outliers, replace=False)
y_contaminated = y_clean.copy()
y_contaminated[outlier_idx] += np.random.choice([-1, 1], n_outliers) * 20
 
# Compare OLS vs Huber
beta_ols = linalg.lstsq(X, y_contaminated)[0]
beta_huber, history = huber_regression_irls(X, y_contaminated)
 
print("True coefficients: [2.0, 3.0]")
print(f"OLS coefficients:  [{beta_ols[0]:.3f}, {beta_ols[1]:.3f}]")
print(f"Huber coefficients: [{beta_huber[0]:.3f}, {beta_huber[1]:.3f}]")
print(f"IRLS converged in {history['iterations']} iterations")
print(f"Mean outlier weight: {history['final_weights'][outlier_idx].mean():.3f}")

Statistical Properties of Huber Estimators

Huber estimators possess well-characterized statistical properties that justify their use in practice. Understanding these properties helps practitioners know when and how to apply Huber regression appropriately.

Consistency:

Under mild regularity conditions, the Huber estimator $\hat{\boldsymbol{\beta}}_H$ is consistent:

$$\hat{\boldsymbol{\beta}}_H \xrightarrow{p} \boldsymbol{\beta}_0 \text{ as } n \to \infty$$

This holds even when the error distribution has heavier tails than Gaussian, provided the tails are symmetric and the contamination fraction is below the breakdown point.

Asymptotic normality:

The Huber estimator is asymptotically normal:

$$\sqrt{n}(\hat{\boldsymbol{\beta}}_H - \boldsymbol{\beta}_0) \xrightarrow{d} \mathcal{N}(\mathbf{0}, V_H)$$

The asymptotic covariance matrix involves the Fisher consistency factor:

$$V_H = \frac{\mathbb{E}[\psi_\delta(\epsilon)^2]}{(\mathbb{E}[\psi'_\delta(\epsilon)])^2} (\mathbf{X}^\top \mathbf{X})^{-1}$$

Efficiency:

The asymptotic relative efficiency (ARE) of Huber vs OLS under Gaussian errors is:

$$\text{ARE}{\text{Huber/OLS}} = \frac{(\mathbb{E}[\psi'\delta(\epsilon)])^2}{\mathbb{E}[\psi_\delta(\epsilon)^2]}$$

For $\delta = 1.345\sigma$, this yields ARE ≈ 0.95 (95% efficiency).

The Robustness-Efficiency Tradeoff

Huber regression achieves an excellent tradeoff: 95% efficiency when data is clean, while providing substantial protection against contamination. This small efficiency cost is excellent insurance against outlier-induced catastrophic failures.

Key Statistical Properties Summary

•Consistency: Converges to true parameters as sample size grows
•Asymptotic normality: Enables standard inference (confidence intervals, hypothesis tests)
•High efficiency: Only 5% less efficient than OLS under ideal conditions
•Bounded influence: Outliers cannot arbitrarily corrupt estimates
•Breakdown point: Better than OLS (exact value depends on design matrix)

Summary: Huber Loss as Foundation for Robust Regression

We have established a deep understanding of Huber loss as the foundational loss function for robust regression. Let's consolidate the key insights:

Key Takeaways

•Squared loss vulnerability: OLS is optimal under Gaussian errors but catastrophically sensitive to outliers due to unbounded influence
•The L1-L2 spectrum: L1 provides robustness but sacrifices efficiency and smoothness; L2 provides efficiency but no robustness
•Huber's innovation: Hybrid loss that uses L2 for small residuals and L1 for large ones, achieving both efficiency and robustness
•The threshold δ: Controls the efficiency-robustness tradeoff; δ = 1.345σ provides 95% efficiency with substantial robustness
•IRLS optimization: Huber regression is solved through iteratively reweighted least squares, down-weighting outliers progressively
•Strong statistical properties: Consistency, asymptotic normality, and bounded influence make Huber estimators theoretically well-founded

What's next:

Huber loss is just the beginning of robust regression. In the next page, we'll explore M-estimators—the general framework of which Huber is a special case. M-estimators allow us to design custom loss functions for different robustness-efficiency tradeoffs, including completely bounded-influence estimators like Tukey's bisquare.

Page Complete

You now understand Huber loss deeply—its motivation, mathematical form, statistical properties, and implementation. This foundation prepares you for the broader M-estimator framework and more advanced robust regression techniques.