Robust Regression - Learning Module

Loading content...

0/278

M-Estimators: The General Framework for Robust Estimation

Beyond Huber: A General Theory of Robust Estimation

In the previous page, we saw how Huber loss provides a principled compromise between the efficiency of L2 loss and the robustness of L1 loss. But Huber loss is just one point in a vast design space. What if we want even more robustness at the cost of some efficiency? What if we want to completely eliminate the influence of extreme outliers rather than merely down-weighting them?

M-estimators (Maximum likelihood-type estimators) provide the general framework that encompasses OLS, LAD, Huber, and many other estimation approaches. The "M" stands for "maximum likelihood-type" because these estimators generalize the maximum likelihood principle—instead of maximizing a likelihood, we minimize a general loss function.

The M-estimator framework gives us precise language and tools to:

Design custom loss functions for specific applications
Analyze the robustness properties of any estimator
Understand the fundamental tradeoffs between efficiency and robustness
Prove statistical properties like consistency and asymptotic normality

This page develops the complete theory of M-estimators, from the defining equations to the most popular choices used in practice, including Tukey's bisquare, Hampel's three-part redescender, and Andrew's wave.

What You Will Learn

By the end of this page, you will understand the general M-estimator framework, the role of ρ, ψ, and weight functions, the classification of estimators (monotone vs. redescending), and how to select appropriate M-estimators for different robustness requirements.

The M-Estimator Definition

Definition:

An M-estimator of regression $\hat{\boldsymbol{\beta}}$ is defined as the value minimizing:

$$\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} \rho\left(\frac{r_i}{\hat{\sigma}}\right) = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} \rho\left(\frac{y_i - \mathbf{x}_i^\top \boldsymbol{\beta}}{\hat{\sigma}}\right)$$

where:

$\rho(\cdot)$ is the objective function (also called loss function or ρ-function)
$r_i = y_i - \mathbf{x}_i^\top \boldsymbol{\beta}$ are the residuals
$\hat{\sigma}$ is a robust scale estimate (e.g., MAD)

The standardization by $\hat{\sigma}$ is crucial—it makes the estimator scale equivariant, meaning the estimates adjust appropriately when data is rescaled.

The three fundamental functions:

Every M-estimator is characterized by three related functions:

ρ-function (rho): The objective function to minimize $$\rho: \mathbb{R} \to \mathbb{R}^+$$
ψ-function (psi): The derivative of ρ, also called the influence function or score function $$\psi(u) = \frac{d\rho(u)}{du}$$
Weight function: For computational purposes $$w(u) = \frac{\psi(u)}{u}$$

The Fundamental Trio

The ρ, ψ, and w functions form a complete characterization of any M-estimator. ρ tells us what we're optimizing, ψ tells us how observations influence the solution, and w enables iteratively reweighted least squares computation.

The Three Functions for Common M-Estimators
Estimator	ρ(u)	ψ(u)	w(u)
OLS (L2)	$\frac{1}{2}u^2$	$u$	$1$
LAD (L1)	$\|u\|$	$\text{sign}(u)$	$1/\|u\|$
Huber	Piecewise (L2/L1)	Bounded linear	Piecewise constant
Tukey Bisquare	Bounded quartic	Redescending	Smooth decay to 0

Properties of ρ-Functions

Not every function can serve as a valid ρ-function for an M-estimator. To ensure well-behaved estimators with desirable statistical properties, ρ-functions should satisfy certain conditions:

Essential properties:

Non-negativity: $\rho(u) \geq 0$ for all $u$
Symmetry: $\rho(-u) = \rho(u)$ (for symmetric error distributions)
Minimum at zero: $\rho(0) = 0$ and $\rho(u) > 0$ for $u eq 0$
Monotonicity: $\rho(u)$ is non-decreasing in $|u|$

Desirable properties:

Continuity: $\rho$ is continuous (ensures estimator is well-defined)
Differentiability: $\rho$ is differentiable (enables gradient-based optimization)
Convexity: $\rho$ is convex (guarantees unique solution)

The convexity tradeoff:

Convex ρ-functions (like Huber) guarantee a unique global minimum, making optimization straightforward. However, convex ρ-functions with unbounded ψ cannot have bounded influence on extreme outliers.

Non-convex ρ-functions (like Tukey's bisquare) can completely eliminate outlier influence but may have multiple local minima, complicating optimization.

Convex ρ-Functions

•Unique global minimum: No local minima traps
•Simple optimization: Gradient descent converges to optimal
•Stable solutions: Small data changes cause small estimate changes
•Examples: L2, L1, Huber

Non-Convex ρ-Functions

•Bounded loss: Complete outlier rejection possible
•Multiple local minima: Optimization is harder
•Need good initialization: Start from robust initial estimate
•Examples: Tukey, Hampel, Andrew's wave

The ψ-Function and Influence

The ψ-function (psi-function) is the heart of M-estimator theory. It directly measures how each observation influences the final estimate.

From optimization to influence:

Setting the derivative of the objective to zero gives the M-estimation equations:

$$\sum_{i=1}^{n} \psi\left(\frac{y_i - \mathbf{x}_i^\top \hat{\boldsymbol{\beta}}}{\hat{\sigma}}\right) \mathbf{x}_i = \mathbf{0}$$

This system of equations defines the M-estimator. The value $\psi(u_i)$ where $u_i = r_i/\hat{\sigma}$ determines observation $i$'s contribution to determining $\hat{\boldsymbol{\beta}}$.

Classifying ψ-functions:

M-estimators are classified by their ψ-function behavior:

1. Monotone ψ-functions: $$\psi(u) \text{ is non-decreasing for } u > 0$$

Examples: OLS ($\psi(u) = u$), Huber (bounded but non-decreasing)

Properties:

Convex ρ-function
Unique solution guaranteed
Cannot completely reject outliers

2. Redescending ψ-functions: $$\psi(u) \to 0 \text{ as } |u| \to \infty$$

Examples: Tukey bisquare, Hampel, Andrew's wave

Properties:

Non-convex ρ-function
Complete outlier rejection (influence → 0)
May have multiple solutions
Higher breakdown point possible

Redescending = Outlier Rejection

A redescending ψ-function means that beyond some threshold, observations have DECREASING influence as they become more extreme. At the limit, extreme outliers have zero influence—they are completely rejected from the estimation.

psi_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
import matplotlib.pyplot as plt
 
def psi_ols(u):
    """OLS psi-function: unbounded linear"""
    return u
 
def psi_huber(u, k=1.345):
    """Huber psi-function: bounded linear"""
    return np.where(np.abs(u) <= k, u, k * np.sign(u))
 
def psi_tukey(u, c=4.685):
    """
    Tukey's bisquare (biweight) psi-function: redescending
    
    The constant c=4.685 gives 95% efficiency under Gaussian errors.
    """
    return np.where(np.abs(u) <= c, u * (1 - (u/c)**2)**2, 0)
 
def psi_hampel(u, a=1.7, b=3.4, c=8.5):
    """
    Hampel's three-part redescending psi-function
    
    Parameters define three regions:
    - Linear growth up to a
    - Constant plateau from a to b  
    - Linear descent from b to c (rejecting outliers)
    """
    abs_u = np.abs(u)
    sign_u = np.sign(u)
    
    result = np.where(abs_u <= a, u,
             np.where(abs_u <= b, a * sign_u,
             np.where(abs_u <= c, a * sign_u * (c - abs_u) / (c - b), 0)))
    return result
 
# Visualization
u = np.linspace(-10, 10, 1000)
 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Plot psi functions
ax1 = axes[0]
ax1.plot(u, psi_ols(u), 'b--', label='OLS (unbounded)', linewidth=2)
ax1.plot(u, psi_huber(u), 'g-', label='Huber (bounded)', linewidth=2)
ax1.plot(u, psi_tukey(u), 'r-', label='Tukey (redescending)', linewidth=2)
ax1.plot(u, psi_hampel(u), 'm-', label='Hampel (redescending)', linewidth=2)
ax1.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
ax1.set_xlabel('Standardized Residual (u)')
ax1.set_ylabel('ψ(u)')
ax1.set_title('ψ-Functions: How Observations Influence Estimates')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xlim(-10, 10)
ax1.set_ylim(-6, 6)
 
# Plot weight functions w(u) = psi(u)/u
ax2 = axes[1]
# Avoid division by zero
u_nonzero = np.where(np.abs(u) > 0.01, u, np.nan)
ax2.plot(u, np.ones_like(u), 'b--', label='OLS (constant)', linewidth=2)
ax2.plot(u, psi_huber(u)/u_nonzero, 'g-', label='Huber', linewidth=2)
ax2.plot(u, psi_tukey(u)/u_nonzero, 'r-', label='Tukey', linewidth=2)
ax2.plot(u, psi_hampel(u)/u_nonzero, 'm-', label='Hampel', linewidth=2)
ax2.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
ax2.set_xlabel('Standardized Residual (u)')
ax2.set_ylabel('w(u) = ψ(u)/u')
ax2.set_title('Weight Functions: Observation Weights in IRLS')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_xlim(-10, 10)
ax2.set_ylim(-0.5, 1.5)
 
plt.tight_layout()
plt.show()
 
print("Key insight: Redescending estimators (Tukey, Hampel)")
print("assign ZERO weight to extreme outliers, completely rejecting them.")

Popular M-Estimators in Practice

Let's examine the most widely-used M-estimators in detail, understanding their mathematical form and practical use cases.

1. Tukey's Bisquare (Biweight)

Proposed by John Tukey, the bisquare is the most popular redescending estimator:

$$\rho(u) = \begin{cases} \frac{c^2}{6}\left[1 - \left(1 - \left(\frac{u}{c}\right)^2\right)^3\right] & |u| \leq c \ \frac{c^2}{6} & |u| > c \end{cases}$$

$$\psi(u) = \begin{cases} u\left(1 - \left(\frac{u}{c}\right)^2\right)^2 & |u| \leq c \ 0 & |u| > c \end{cases}$$

With $c = 4.685$, Tukey's bisquare achieves 95% efficiency under Gaussian errors.

Properties:

Complete outlier rejection beyond $|u| > c$
Smooth transitions (infinitely differentiable)
Bounded loss (plateaus at $c^2/6$)
Requires good starting point due to non-convexity

Why Bisquare?

The name 'bisquare' comes from the (1 - u²)² factor—squaring the (1 - u²) term. This creates smooth descent to zero influence, unlike hard cutoffs that could cause numerical instabilities.

2. Hampel's Three-Part Redescender

Frank Hampel proposed a more interpretable piecewise-linear ψ-function with three distinct regions:

$$\psi(u) = \begin{cases} u & 0 \leq |u| \leq a \ a \cdot \text{sign}(u) & a < |u| \leq b \ a \cdot \text{sign}(u) \cdot \frac{c - |u|}{c - b} & b < |u| \leq c \ 0 & |u| > c \end{cases}$$

Standard constants: $a = 1.7, b = 3.4, c = 8.5$

Intuition:

Region [0, a]: Normal observations, treated like OLS
Region [a, b]: Suspicious observations, capped influence
Region [b, c]: Likely outliers, declining influence
Region [c, ∞): Definite outliers, zero influence

3. Andrew's Wave (Sine Wave)

$$\psi(u) = \begin{cases} \sin(u/c) & |u| \leq c\pi \ 0 & |u| > c\pi \end{cases}$$

Standard constant: $c = 1.339$

Andrew's wave uses a sinusoidal form for very smooth transitions. The wave pattern means influence oscillates before reaching zero, which can cause numerical issues.

M-Estimator Comparison
Estimator	Type	Efficiency (Gaussian)	Breakdown	Use Case
OLS	Monotone (unbounded)	100%	0%	Clean data, Gaussian errors
Huber (k=1.345)	Monotone (bounded)	95%	~0%	Mild outliers expected
Tukey (c=4.685)	Redescending	95%	~50%*	Significant outliers expected
Hampel	Redescending	95%	~50%*	Need interpretable cutoffs
Andrew's Wave	Redescending	95%	~50%*	Very smooth transitions needed

Breakdown Point Caveat

The ~50% breakdown point for redescending estimators is only achieved with appropriate scale estimation and design matrix conditions. S-estimators or MM-estimators are needed to guarantee high breakdown points in regression settings.

IRLS for General M-Estimators

Iteratively Reweighted Least Squares (IRLS) provides a unified algorithm for computing any M-estimator, regardless of the specific ρ-function chosen.

The IRLS algorithm:

The M-estimation equations can be written as:

$$\sum_{i=1}^{n} w_i(\boldsymbol{\beta}) \cdot r_i \cdot \mathbf{x}_i = \mathbf{0}$$

where $w_i = \psi(u_i)/u_i$ and $u_i = r_i/\hat{\sigma}$.

This is equivalent to weighted least squares with weights $w_i$—but the weights depend on $\boldsymbol{\beta}$! IRLS resolves this by iterating:

Initialize: Start with $\boldsymbol{\beta}^{(0)}$ (often from L1 or OLS)
Compute residuals: $r_i^{(t)} = y_i - \mathbf{x}_i^\top \boldsymbol{\beta}^{(t)}$
Update scale: $\hat{\sigma}^{(t)} = \text{MAD}(r^{(t)}) \times 1.4826$
Compute weights: $w_i^{(t)} = w(r_i^{(t)}/\hat{\sigma}^{(t)})$
Weighted least squares: $\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{y}$
Repeat until convergence

m_estimator_irls.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import numpy as np
from scipy import linalg
 
def weight_tukey(u, c=4.685):
    """Tukey bisquare weight function"""
    return np.where(np.abs(u) <= c, (1 - (u/c)**2)**2, 0)
 
def weight_huber(u, k=1.345):
    """Huber weight function"""
    return np.where(np.abs(u) <= k, 1.0, k / np.abs(u))
 
def weight_hampel(u, a=1.7, b=3.4, c=8.5):
    """Hampel weight function"""
    abs_u = np.abs(u)
    return np.where(abs_u <= a, 1.0,
           np.where(abs_u <= b, a / abs_u,
           np.where(abs_u <= c, a * (c - abs_u) / (abs_u * (c - b)), 0)))
 
def m_estimator_irls(X, y, weight_func, max_iter=100, tol=1e-6):
    """
    General M-estimator via IRLS.
    
    Parameters:
    -----------
    X : array (n, p)
        Design matrix with intercept
    y : array (n,)
        Response
    weight_func : callable
        Weight function w(u) = psi(u)/u
    max_iter : int
        Maximum iterations
    tol : float
        Convergence tolerance
    
    Returns:
    --------
    beta : array (p,)
        Estimated coefficients
    info : dict
        Convergence information
    """
    n, p = X.shape
    
    # Robust initial estimate via L1 (LAD)
    # For simplicity, use OLS as starting point
    beta = linalg.lstsq(X, y)[0]
    
    info = {'iterations': 0, 'converged': False}
    
    for iteration in range(max_iter):
        # Residuals
        r = y - X @ beta
        
        # Robust scale (MAD)
        mad = np.median(np.abs(r - np.median(r)))
        sigma = 1.4826 * mad if mad > 1e-10 else 1.0
        
        # Standardized residuals
        u = r / sigma
        
        # Weights
        w = weight_func(u)
        
        # Handle zero weights (for numerical stability)
        w = np.maximum(w, 1e-10)
        
        # Weighted least squares
        W = np.diag(w)
        XtWX = X.T @ W @ X
        XtWy = X.T @ W @ y
        
        # Add small regularization for stability
        beta_new = linalg.solve(XtWX + 1e-10 * np.eye(p), XtWy)
        
        # Check convergence
        change = np.max(np.abs(beta_new - beta))
        
        if change < tol:
            info['converged'] = True
            info['iterations'] = iteration + 1
            beta = beta_new
            break
        
        beta = beta_new
    
    info['final_weights'] = w
    info['final_scale'] = sigma
    
    return beta, info
 
# Example comparison
np.random.seed(42)
n = 100
 
# Generate data with outliers
X = np.column_stack([np.ones(n), np.random.randn(n)])
true_beta = np.array([2.0, 3.0])
y = X @ true_beta + 0.5 * np.random.randn(n)
 
# Add 10% outliers
outlier_idx = np.random.choice(n, 10, replace=False)
y[outlier_idx] += np.random.choice([-1, 1], 10) * 15
 
# Fit various M-estimators
beta_ols = linalg.lstsq(X, y)[0]
beta_huber, _ = m_estimator_irls(X, y, weight_huber)
beta_tukey, info = m_estimator_irls(X, y, weight_tukey)
 
print("True coefficients:", true_beta)
print(f"OLS:   [{beta_ols[0]:.3f}, {beta_ols[1]:.3f}]")
print(f"Huber: [{beta_huber[0]:.3f}, {beta_huber[1]:.3f}]")
print(f"Tukey: [{beta_tukey[0]:.3f}, {beta_tukey[1]:.3f}]")
print(f"
Tukey outlier weights: {info['final_weights'][outlier_idx]}")

Choosing an M-Estimator

With many M-estimators available, how should practitioners choose? The decision depends on your data characteristics, robustness requirements, and computational constraints.

Decision framework:

M-Estimator Selection Guide

•Clean data, maximum efficiency needed: Use OLS—no reason to sacrifice efficiency when there's no contamination
•Mild outliers expected (<5%): Use Huber—excellent efficiency with protection against moderate outliers
•Significant outliers expected (5-20%): Use Tukey bisquare—complete rejection of extreme outliers while maintaining efficiency
•Very high contamination or adversarial data: Consider S-estimators or MM-estimators (covered in later pages)
•Need interpretable cutoffs: Use Hampel—explicit regions for normal/suspicious/outlier observations
•Smooth numerical behavior critical: Use Tukey or Andrew's wave—infinitely differentiable ψ-functions

Practical Recommendation

For most applications, Huber regression (monotone) or Tukey's bisquare (redescending) covers the vast majority of use cases. Start with Huber for its simplicity and guaranteed convergence; upgrade to Tukey when you need complete outlier rejection.

Summary: M-Estimators as the Robust Estimation Framework

Key Takeaways

•M-estimators generalize maximum likelihood: Minimize ρ(residual) instead of squared error, enabling custom robustness properties
•Three fundamental functions: ρ (objective), ψ (influence), and w (weights) completely characterize any M-estimator
•Monotone vs. redescending: Monotone ψ guarantees unique solution; redescending ψ enables complete outlier rejection
•Popular choices: Huber (bounded monotone), Tukey bisquare (smooth redescending), Hampel (piecewise redescending)
•IRLS provides unified computation: Any M-estimator can be computed via iteratively reweighted least squares
•Selection depends on context: Match estimator choice to expected contamination level and computational requirements

What's next:

M-estimators provide elegant theory but face challenges when outliers are extreme or when outliers occur in the X-space (leverage points). The next page introduces RANSAC (Random Sample Consensus)—a fundamentally different approach that explicitly models data as inliers plus outliers, achieving remarkable robustness even under extreme contamination.

Page Complete

You now understand M-estimators as the general framework for robust estimation, including the roles of ρ, ψ, and weight functions, the distinction between monotone and redescending estimators, and how to select appropriate estimators for different scenarios.

M-Estimators: The General Framework for Robust Estimation

Beyond Huber: A General Theory of Robust Estimation

The M-estimator framework gives us precise language and tools to:

Design custom loss functions for specific applications
Analyze the robustness properties of any estimator
Understand the fundamental tradeoffs between efficiency and robustness
Prove statistical properties like consistency and asymptotic normality

What You Will Learn

The M-Estimator Definition

Definition:

An M-estimator of regression $\hat{\boldsymbol{\beta}}$ is defined as the value minimizing:

where:

$\rho(\cdot)$ is the objective function (also called loss function or ρ-function)
$r_i = y_i - \mathbf{x}_i^\top \boldsymbol{\beta}$ are the residuals
$\hat{\sigma}$ is a robust scale estimate (e.g., MAD)

The standardization by $\hat{\sigma}$ is crucial—it makes the estimator scale equivariant, meaning the estimates adjust appropriately when data is rescaled.

The three fundamental functions:

Every M-estimator is characterized by three related functions:

ρ-function (rho): The objective function to minimize $$\rho: \mathbb{R} \to \mathbb{R}^+$$
ψ-function (psi): The derivative of ρ, also called the influence function or score function $$\psi(u) = \frac{d\rho(u)}{du}$$
Weight function: For computational purposes $$w(u) = \frac{\psi(u)}{u}$$

The Fundamental Trio

The Three Functions for Common M-Estimators
Estimator	ρ(u)	ψ(u)	w(u)
OLS (L2)	$\frac{1}{2}u^2$	$u$	$1$
LAD (L1)	$\|u\|$	$\text{sign}(u)$	$1/\|u\|$
Huber	Piecewise (L2/L1)	Bounded linear	Piecewise constant
Tukey Bisquare	Bounded quartic	Redescending	Smooth decay to 0

Properties of ρ-Functions

Not every function can serve as a valid ρ-function for an M-estimator. To ensure well-behaved estimators with desirable statistical properties, ρ-functions should satisfy certain conditions:

Essential properties:

Non-negativity: $\rho(u) \geq 0$ for all $u$
Symmetry: $\rho(-u) = \rho(u)$ (for symmetric error distributions)
Minimum at zero: $\rho(0) = 0$ and $\rho(u) > 0$ for $u eq 0$
Monotonicity: $\rho(u)$ is non-decreasing in $|u|$

Desirable properties:

Continuity: $\rho$ is continuous (ensures estimator is well-defined)
Differentiability: $\rho$ is differentiable (enables gradient-based optimization)
Convexity: $\rho$ is convex (guarantees unique solution)

The convexity tradeoff:

Non-convex ρ-functions (like Tukey's bisquare) can completely eliminate outlier influence but may have multiple local minima, complicating optimization.

Convex ρ-Functions

•Unique global minimum: No local minima traps
•Simple optimization: Gradient descent converges to optimal
•Stable solutions: Small data changes cause small estimate changes
•Examples: L2, L1, Huber

Non-Convex ρ-Functions

•Bounded loss: Complete outlier rejection possible
•Multiple local minima: Optimization is harder
•Need good initialization: Start from robust initial estimate
•Examples: Tukey, Hampel, Andrew's wave

The ψ-Function and Influence

The ψ-function (psi-function) is the heart of M-estimator theory. It directly measures how each observation influences the final estimate.

From optimization to influence:

Setting the derivative of the objective to zero gives the M-estimation equations:

$$\sum_{i=1}^{n} \psi\left(\frac{y_i - \mathbf{x}_i^\top \hat{\boldsymbol{\beta}}}{\hat{\sigma}}\right) \mathbf{x}_i = \mathbf{0}$$

This system of equations defines the M-estimator. The value $\psi(u_i)$ where $u_i = r_i/\hat{\sigma}$ determines observation $i$'s contribution to determining $\hat{\boldsymbol{\beta}}$.

Classifying ψ-functions:

M-estimators are classified by their ψ-function behavior:

1. Monotone ψ-functions: $$\psi(u) \text{ is non-decreasing for } u > 0$$

Examples: OLS ($\psi(u) = u$), Huber (bounded but non-decreasing)

Properties:

Convex ρ-function
Unique solution guaranteed
Cannot completely reject outliers

2. Redescending ψ-functions: $$\psi(u) \to 0 \text{ as } |u| \to \infty$$

Examples: Tukey bisquare, Hampel, Andrew's wave

Properties:

Non-convex ρ-function
Complete outlier rejection (influence → 0)
May have multiple solutions
Higher breakdown point possible

Redescending = Outlier Rejection

psi_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
import matplotlib.pyplot as plt
 
def psi_ols(u):
    """OLS psi-function: unbounded linear"""
    return u
 
def psi_huber(u, k=1.345):
    """Huber psi-function: bounded linear"""
    return np.where(np.abs(u) <= k, u, k * np.sign(u))
 
def psi_tukey(u, c=4.685):
    """
    Tukey's bisquare (biweight) psi-function: redescending
    
    The constant c=4.685 gives 95% efficiency under Gaussian errors.
    """
    return np.where(np.abs(u) <= c, u * (1 - (u/c)**2)**2, 0)
 
def psi_hampel(u, a=1.7, b=3.4, c=8.5):
    """
    Hampel's three-part redescending psi-function
    
    Parameters define three regions:
    - Linear growth up to a
    - Constant plateau from a to b  
    - Linear descent from b to c (rejecting outliers)
    """
    abs_u = np.abs(u)
    sign_u = np.sign(u)
    
    result = np.where(abs_u <= a, u,
             np.where(abs_u <= b, a * sign_u,
             np.where(abs_u <= c, a * sign_u * (c - abs_u) / (c - b), 0)))
    return result
 
# Visualization
u = np.linspace(-10, 10, 1000)
 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Plot psi functions
ax1 = axes[0]
ax1.plot(u, psi_ols(u), 'b--', label='OLS (unbounded)', linewidth=2)
ax1.plot(u, psi_huber(u), 'g-', label='Huber (bounded)', linewidth=2)
ax1.plot(u, psi_tukey(u), 'r-', label='Tukey (redescending)', linewidth=2)
ax1.plot(u, psi_hampel(u), 'm-', label='Hampel (redescending)', linewidth=2)
ax1.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
ax1.set_xlabel('Standardized Residual (u)')
ax1.set_ylabel('ψ(u)')
ax1.set_title('ψ-Functions: How Observations Influence Estimates')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xlim(-10, 10)
ax1.set_ylim(-6, 6)
 
# Plot weight functions w(u) = psi(u)/u
ax2 = axes[1]
# Avoid division by zero
u_nonzero = np.where(np.abs(u) > 0.01, u, np.nan)
ax2.plot(u, np.ones_like(u), 'b--', label='OLS (constant)', linewidth=2)
ax2.plot(u, psi_huber(u)/u_nonzero, 'g-', label='Huber', linewidth=2)
ax2.plot(u, psi_tukey(u)/u_nonzero, 'r-', label='Tukey', linewidth=2)
ax2.plot(u, psi_hampel(u)/u_nonzero, 'm-', label='Hampel', linewidth=2)
ax2.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
ax2.set_xlabel('Standardized Residual (u)')
ax2.set_ylabel('w(u) = ψ(u)/u')
ax2.set_title('Weight Functions: Observation Weights in IRLS')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_xlim(-10, 10)
ax2.set_ylim(-0.5, 1.5)
 
plt.tight_layout()
plt.show()
 
print("Key insight: Redescending estimators (Tukey, Hampel)")
print("assign ZERO weight to extreme outliers, completely rejecting them.")

Popular M-Estimators in Practice

Let's examine the most widely-used M-estimators in detail, understanding their mathematical form and practical use cases.

1. Tukey's Bisquare (Biweight)

Proposed by John Tukey, the bisquare is the most popular redescending estimator:

$$\rho(u) = \begin{cases} \frac{c^2}{6}\left[1 - \left(1 - \left(\frac{u}{c}\right)^2\right)^3\right] & |u| \leq c \ \frac{c^2}{6} & |u| > c \end{cases}$$

$$\psi(u) = \begin{cases} u\left(1 - \left(\frac{u}{c}\right)^2\right)^2 & |u| \leq c \ 0 & |u| > c \end{cases}$$

With $c = 4.685$, Tukey's bisquare achieves 95% efficiency under Gaussian errors.

Properties:

Complete outlier rejection beyond $|u| > c$
Smooth transitions (infinitely differentiable)
Bounded loss (plateaus at $c^2/6$)
Requires good starting point due to non-convexity

Why Bisquare?

The name 'bisquare' comes from the (1 - u²)² factor—squaring the (1 - u²) term. This creates smooth descent to zero influence, unlike hard cutoffs that could cause numerical instabilities.

2. Hampel's Three-Part Redescender

Frank Hampel proposed a more interpretable piecewise-linear ψ-function with three distinct regions:

$$\psi(u) = \begin{cases} u & 0 \leq |u| \leq a \ a \cdot \text{sign}(u) & a < |u| \leq b \ a \cdot \text{sign}(u) \cdot \frac{c - |u|}{c - b} & b < |u| \leq c \ 0 & |u| > c \end{cases}$$

Standard constants: $a = 1.7, b = 3.4, c = 8.5$

Intuition:

Region [0, a]: Normal observations, treated like OLS
Region [a, b]: Suspicious observations, capped influence
Region [b, c]: Likely outliers, declining influence
Region [c, ∞): Definite outliers, zero influence

3. Andrew's Wave (Sine Wave)

$$\psi(u) = \begin{cases} \sin(u/c) & |u| \leq c\pi \ 0 & |u| > c\pi \end{cases}$$

Standard constant: $c = 1.339$

Andrew's wave uses a sinusoidal form for very smooth transitions. The wave pattern means influence oscillates before reaching zero, which can cause numerical issues.

M-Estimator Comparison
Estimator	Type	Efficiency (Gaussian)	Breakdown	Use Case
OLS	Monotone (unbounded)	100%	0%	Clean data, Gaussian errors
Huber (k=1.345)	Monotone (bounded)	95%	~0%	Mild outliers expected
Tukey (c=4.685)	Redescending	95%	~50%*	Significant outliers expected
Hampel	Redescending	95%	~50%*	Need interpretable cutoffs
Andrew's Wave	Redescending	95%	~50%*	Very smooth transitions needed

Breakdown Point Caveat

IRLS for General M-Estimators

Iteratively Reweighted Least Squares (IRLS) provides a unified algorithm for computing any M-estimator, regardless of the specific ρ-function chosen.

The IRLS algorithm:

The M-estimation equations can be written as:

$$\sum_{i=1}^{n} w_i(\boldsymbol{\beta}) \cdot r_i \cdot \mathbf{x}_i = \mathbf{0}$$

where $w_i = \psi(u_i)/u_i$ and $u_i = r_i/\hat{\sigma}$.

This is equivalent to weighted least squares with weights $w_i$—but the weights depend on $\boldsymbol{\beta}$! IRLS resolves this by iterating:

Initialize: Start with $\boldsymbol{\beta}^{(0)}$ (often from L1 or OLS)
Compute residuals: $r_i^{(t)} = y_i - \mathbf{x}_i^\top \boldsymbol{\beta}^{(t)}$
Update scale: $\hat{\sigma}^{(t)} = \text{MAD}(r^{(t)}) \times 1.4826$
Compute weights: $w_i^{(t)} = w(r_i^{(t)}/\hat{\sigma}^{(t)})$
Weighted least squares: $\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{y}$
Repeat until convergence

m_estimator_irls.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import numpy as np
from scipy import linalg
 
def weight_tukey(u, c=4.685):
    """Tukey bisquare weight function"""
    return np.where(np.abs(u) <= c, (1 - (u/c)**2)**2, 0)
 
def weight_huber(u, k=1.345):
    """Huber weight function"""
    return np.where(np.abs(u) <= k, 1.0, k / np.abs(u))
 
def weight_hampel(u, a=1.7, b=3.4, c=8.5):
    """Hampel weight function"""
    abs_u = np.abs(u)
    return np.where(abs_u <= a, 1.0,
           np.where(abs_u <= b, a / abs_u,
           np.where(abs_u <= c, a * (c - abs_u) / (abs_u * (c - b)), 0)))
 
def m_estimator_irls(X, y, weight_func, max_iter=100, tol=1e-6):
    """
    General M-estimator via IRLS.
    
    Parameters:
    -----------
    X : array (n, p)
        Design matrix with intercept
    y : array (n,)
        Response
    weight_func : callable
        Weight function w(u) = psi(u)/u
    max_iter : int
        Maximum iterations
    tol : float
        Convergence tolerance
    
    Returns:
    --------
    beta : array (p,)
        Estimated coefficients
    info : dict
        Convergence information
    """
    n, p = X.shape
    
    # Robust initial estimate via L1 (LAD)
    # For simplicity, use OLS as starting point
    beta = linalg.lstsq(X, y)[0]
    
    info = {'iterations': 0, 'converged': False}
    
    for iteration in range(max_iter):
        # Residuals
        r = y - X @ beta
        
        # Robust scale (MAD)
        mad = np.median(np.abs(r - np.median(r)))
        sigma = 1.4826 * mad if mad > 1e-10 else 1.0
        
        # Standardized residuals
        u = r / sigma
        
        # Weights
        w = weight_func(u)
        
        # Handle zero weights (for numerical stability)
        w = np.maximum(w, 1e-10)
        
        # Weighted least squares
        W = np.diag(w)
        XtWX = X.T @ W @ X
        XtWy = X.T @ W @ y
        
        # Add small regularization for stability
        beta_new = linalg.solve(XtWX + 1e-10 * np.eye(p), XtWy)
        
        # Check convergence
        change = np.max(np.abs(beta_new - beta))
        
        if change < tol:
            info['converged'] = True
            info['iterations'] = iteration + 1
            beta = beta_new
            break
        
        beta = beta_new
    
    info['final_weights'] = w
    info['final_scale'] = sigma
    
    return beta, info
 
# Example comparison
np.random.seed(42)
n = 100
 
# Generate data with outliers
X = np.column_stack([np.ones(n), np.random.randn(n)])
true_beta = np.array([2.0, 3.0])
y = X @ true_beta + 0.5 * np.random.randn(n)
 
# Add 10% outliers
outlier_idx = np.random.choice(n, 10, replace=False)
y[outlier_idx] += np.random.choice([-1, 1], 10) * 15
 
# Fit various M-estimators
beta_ols = linalg.lstsq(X, y)[0]
beta_huber, _ = m_estimator_irls(X, y, weight_huber)
beta_tukey, info = m_estimator_irls(X, y, weight_tukey)
 
print("True coefficients:", true_beta)
print(f"OLS:   [{beta_ols[0]:.3f}, {beta_ols[1]:.3f}]")
print(f"Huber: [{beta_huber[0]:.3f}, {beta_huber[1]:.3f}]")
print(f"Tukey: [{beta_tukey[0]:.3f}, {beta_tukey[1]:.3f}]")
print(f"
Tukey outlier weights: {info['final_weights'][outlier_idx]}")

Choosing an M-Estimator

With many M-estimators available, how should practitioners choose? The decision depends on your data characteristics, robustness requirements, and computational constraints.

Decision framework:

M-Estimator Selection Guide

•Clean data, maximum efficiency needed: Use OLS—no reason to sacrifice efficiency when there's no contamination
•Mild outliers expected (<5%): Use Huber—excellent efficiency with protection against moderate outliers
•Significant outliers expected (5-20%): Use Tukey bisquare—complete rejection of extreme outliers while maintaining efficiency
•Very high contamination or adversarial data: Consider S-estimators or MM-estimators (covered in later pages)
•Need interpretable cutoffs: Use Hampel—explicit regions for normal/suspicious/outlier observations
•Smooth numerical behavior critical: Use Tukey or Andrew's wave—infinitely differentiable ψ-functions

Practical Recommendation

Summary: M-Estimators as the Robust Estimation Framework

Key Takeaways

•M-estimators generalize maximum likelihood: Minimize ρ(residual) instead of squared error, enabling custom robustness properties
•Three fundamental functions: ρ (objective), ψ (influence), and w (weights) completely characterize any M-estimator
•Monotone vs. redescending: Monotone ψ guarantees unique solution; redescending ψ enables complete outlier rejection
•Popular choices: Huber (bounded monotone), Tukey bisquare (smooth redescending), Hampel (piecewise redescending)
•IRLS provides unified computation: Any M-estimator can be computed via iteratively reweighted least squares
•Selection depends on context: Match estimator choice to expected contamination level and computational requirements

What's next:

Page Complete