Loading content...
In the previous page, we saw how Huber loss provides a principled compromise between the efficiency of L2 loss and the robustness of L1 loss. But Huber loss is just one point in a vast design space. What if we want even more robustness at the cost of some efficiency? What if we want to completely eliminate the influence of extreme outliers rather than merely down-weighting them?
M-estimators (Maximum likelihood-type estimators) provide the general framework that encompasses OLS, LAD, Huber, and many other estimation approaches. The "M" stands for "maximum likelihood-type" because these estimators generalize the maximum likelihood principle—instead of maximizing a likelihood, we minimize a general loss function.
The M-estimator framework gives us precise language and tools to:
This page develops the complete theory of M-estimators, from the defining equations to the most popular choices used in practice, including Tukey's bisquare, Hampel's three-part redescender, and Andrew's wave.
By the end of this page, you will understand the general M-estimator framework, the role of ρ, ψ, and weight functions, the classification of estimators (monotone vs. redescending), and how to select appropriate M-estimators for different robustness requirements.
Definition:
An M-estimator of regression $\hat{\boldsymbol{\beta}}$ is defined as the value minimizing:
$$\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} \rho\left(\frac{r_i}{\hat{\sigma}}\right) = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} \rho\left(\frac{y_i - \mathbf{x}_i^\top \boldsymbol{\beta}}{\hat{\sigma}}\right)$$
where:
The standardization by $\hat{\sigma}$ is crucial—it makes the estimator scale equivariant, meaning the estimates adjust appropriately when data is rescaled.
The three fundamental functions:
Every M-estimator is characterized by three related functions:
ρ-function (rho): The objective function to minimize $$\rho: \mathbb{R} \to \mathbb{R}^+$$
ψ-function (psi): The derivative of ρ, also called the influence function or score function $$\psi(u) = \frac{d\rho(u)}{du}$$
Weight function: For computational purposes $$w(u) = \frac{\psi(u)}{u}$$
The ρ, ψ, and w functions form a complete characterization of any M-estimator. ρ tells us what we're optimizing, ψ tells us how observations influence the solution, and w enables iteratively reweighted least squares computation.
| Estimator | ρ(u) | ψ(u) | w(u) |
|---|---|---|---|
| OLS (L2) | $\frac{1}{2}u^2$ | $u$ | $1$ |
| LAD (L1) | $|u|$ | $\text{sign}(u)$ | $1/|u|$ |
| Huber | Piecewise (L2/L1) | Bounded linear | Piecewise constant |
| Tukey Bisquare | Bounded quartic | Redescending | Smooth decay to 0 |
Not every function can serve as a valid ρ-function for an M-estimator. To ensure well-behaved estimators with desirable statistical properties, ρ-functions should satisfy certain conditions:
Essential properties:
Desirable properties:
The convexity tradeoff:
Convex ρ-functions (like Huber) guarantee a unique global minimum, making optimization straightforward. However, convex ρ-functions with unbounded ψ cannot have bounded influence on extreme outliers.
Non-convex ρ-functions (like Tukey's bisquare) can completely eliminate outlier influence but may have multiple local minima, complicating optimization.
The ψ-function (psi-function) is the heart of M-estimator theory. It directly measures how each observation influences the final estimate.
From optimization to influence:
Setting the derivative of the objective to zero gives the M-estimation equations:
$$\sum_{i=1}^{n} \psi\left(\frac{y_i - \mathbf{x}_i^\top \hat{\boldsymbol{\beta}}}{\hat{\sigma}}\right) \mathbf{x}_i = \mathbf{0}$$
This system of equations defines the M-estimator. The value $\psi(u_i)$ where $u_i = r_i/\hat{\sigma}$ determines observation $i$'s contribution to determining $\hat{\boldsymbol{\beta}}$.
Classifying ψ-functions:
M-estimators are classified by their ψ-function behavior:
1. Monotone ψ-functions: $$\psi(u) \text{ is non-decreasing for } u > 0$$
Examples: OLS ($\psi(u) = u$), Huber (bounded but non-decreasing)
Properties:
2. Redescending ψ-functions: $$\psi(u) \to 0 \text{ as } |u| \to \infty$$
Examples: Tukey bisquare, Hampel, Andrew's wave
Properties:
A redescending ψ-function means that beyond some threshold, observations have DECREASING influence as they become more extreme. At the limit, extreme outliers have zero influence—they are completely rejected from the estimation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as npimport matplotlib.pyplot as plt def psi_ols(u): """OLS psi-function: unbounded linear""" return u def psi_huber(u, k=1.345): """Huber psi-function: bounded linear""" return np.where(np.abs(u) <= k, u, k * np.sign(u)) def psi_tukey(u, c=4.685): """ Tukey's bisquare (biweight) psi-function: redescending The constant c=4.685 gives 95% efficiency under Gaussian errors. """ return np.where(np.abs(u) <= c, u * (1 - (u/c)**2)**2, 0) def psi_hampel(u, a=1.7, b=3.4, c=8.5): """ Hampel's three-part redescending psi-function Parameters define three regions: - Linear growth up to a - Constant plateau from a to b - Linear descent from b to c (rejecting outliers) """ abs_u = np.abs(u) sign_u = np.sign(u) result = np.where(abs_u <= a, u, np.where(abs_u <= b, a * sign_u, np.where(abs_u <= c, a * sign_u * (c - abs_u) / (c - b), 0))) return result # Visualizationu = np.linspace(-10, 10, 1000) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot psi functionsax1 = axes[0]ax1.plot(u, psi_ols(u), 'b--', label='OLS (unbounded)', linewidth=2)ax1.plot(u, psi_huber(u), 'g-', label='Huber (bounded)', linewidth=2)ax1.plot(u, psi_tukey(u), 'r-', label='Tukey (redescending)', linewidth=2)ax1.plot(u, psi_hampel(u), 'm-', label='Hampel (redescending)', linewidth=2)ax1.axhline(y=0, color='gray', linestyle='-', alpha=0.3)ax1.set_xlabel('Standardized Residual (u)')ax1.set_ylabel('ψ(u)')ax1.set_title('ψ-Functions: How Observations Influence Estimates')ax1.legend()ax1.grid(True, alpha=0.3)ax1.set_xlim(-10, 10)ax1.set_ylim(-6, 6) # Plot weight functions w(u) = psi(u)/uax2 = axes[1]# Avoid division by zerou_nonzero = np.where(np.abs(u) > 0.01, u, np.nan)ax2.plot(u, np.ones_like(u), 'b--', label='OLS (constant)', linewidth=2)ax2.plot(u, psi_huber(u)/u_nonzero, 'g-', label='Huber', linewidth=2)ax2.plot(u, psi_tukey(u)/u_nonzero, 'r-', label='Tukey', linewidth=2)ax2.plot(u, psi_hampel(u)/u_nonzero, 'm-', label='Hampel', linewidth=2)ax2.axhline(y=0, color='gray', linestyle='-', alpha=0.3)ax2.set_xlabel('Standardized Residual (u)')ax2.set_ylabel('w(u) = ψ(u)/u')ax2.set_title('Weight Functions: Observation Weights in IRLS')ax2.legend()ax2.grid(True, alpha=0.3)ax2.set_xlim(-10, 10)ax2.set_ylim(-0.5, 1.5) plt.tight_layout()plt.show() print("Key insight: Redescending estimators (Tukey, Hampel)")print("assign ZERO weight to extreme outliers, completely rejecting them.")Let's examine the most widely-used M-estimators in detail, understanding their mathematical form and practical use cases.
1. Tukey's Bisquare (Biweight)
Proposed by John Tukey, the bisquare is the most popular redescending estimator:
$$\rho(u) = \begin{cases} \frac{c^2}{6}\left[1 - \left(1 - \left(\frac{u}{c}\right)^2\right)^3\right] & |u| \leq c \ \frac{c^2}{6} & |u| > c \end{cases}$$
$$\psi(u) = \begin{cases} u\left(1 - \left(\frac{u}{c}\right)^2\right)^2 & |u| \leq c \ 0 & |u| > c \end{cases}$$
With $c = 4.685$, Tukey's bisquare achieves 95% efficiency under Gaussian errors.
Properties:
The name 'bisquare' comes from the (1 - u²)² factor—squaring the (1 - u²) term. This creates smooth descent to zero influence, unlike hard cutoffs that could cause numerical instabilities.
2. Hampel's Three-Part Redescender
Frank Hampel proposed a more interpretable piecewise-linear ψ-function with three distinct regions:
$$\psi(u) = \begin{cases} u & 0 \leq |u| \leq a \ a \cdot \text{sign}(u) & a < |u| \leq b \ a \cdot \text{sign}(u) \cdot \frac{c - |u|}{c - b} & b < |u| \leq c \ 0 & |u| > c \end{cases}$$
Standard constants: $a = 1.7, b = 3.4, c = 8.5$
Intuition:
3. Andrew's Wave (Sine Wave)
$$\psi(u) = \begin{cases} \sin(u/c) & |u| \leq c\pi \ 0 & |u| > c\pi \end{cases}$$
Standard constant: $c = 1.339$
Andrew's wave uses a sinusoidal form for very smooth transitions. The wave pattern means influence oscillates before reaching zero, which can cause numerical issues.
| Estimator | Type | Efficiency (Gaussian) | Breakdown | Use Case |
|---|---|---|---|---|
| OLS | Monotone (unbounded) | 100% | 0% | Clean data, Gaussian errors |
| Huber (k=1.345) | Monotone (bounded) | 95% | ~0% | Mild outliers expected |
| Tukey (c=4.685) | Redescending | 95% | ~50%* | Significant outliers expected |
| Hampel | Redescending | 95% | ~50%* | Need interpretable cutoffs |
| Andrew's Wave | Redescending | 95% | ~50%* | Very smooth transitions needed |
The ~50% breakdown point for redescending estimators is only achieved with appropriate scale estimation and design matrix conditions. S-estimators or MM-estimators are needed to guarantee high breakdown points in regression settings.
Iteratively Reweighted Least Squares (IRLS) provides a unified algorithm for computing any M-estimator, regardless of the specific ρ-function chosen.
The IRLS algorithm:
The M-estimation equations can be written as:
$$\sum_{i=1}^{n} w_i(\boldsymbol{\beta}) \cdot r_i \cdot \mathbf{x}_i = \mathbf{0}$$
where $w_i = \psi(u_i)/u_i$ and $u_i = r_i/\hat{\sigma}$.
This is equivalent to weighted least squares with weights $w_i$—but the weights depend on $\boldsymbol{\beta}$! IRLS resolves this by iterating:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
import numpy as npfrom scipy import linalg def weight_tukey(u, c=4.685): """Tukey bisquare weight function""" return np.where(np.abs(u) <= c, (1 - (u/c)**2)**2, 0) def weight_huber(u, k=1.345): """Huber weight function""" return np.where(np.abs(u) <= k, 1.0, k / np.abs(u)) def weight_hampel(u, a=1.7, b=3.4, c=8.5): """Hampel weight function""" abs_u = np.abs(u) return np.where(abs_u <= a, 1.0, np.where(abs_u <= b, a / abs_u, np.where(abs_u <= c, a * (c - abs_u) / (abs_u * (c - b)), 0))) def m_estimator_irls(X, y, weight_func, max_iter=100, tol=1e-6): """ General M-estimator via IRLS. Parameters: ----------- X : array (n, p) Design matrix with intercept y : array (n,) Response weight_func : callable Weight function w(u) = psi(u)/u max_iter : int Maximum iterations tol : float Convergence tolerance Returns: -------- beta : array (p,) Estimated coefficients info : dict Convergence information """ n, p = X.shape # Robust initial estimate via L1 (LAD) # For simplicity, use OLS as starting point beta = linalg.lstsq(X, y)[0] info = {'iterations': 0, 'converged': False} for iteration in range(max_iter): # Residuals r = y - X @ beta # Robust scale (MAD) mad = np.median(np.abs(r - np.median(r))) sigma = 1.4826 * mad if mad > 1e-10 else 1.0 # Standardized residuals u = r / sigma # Weights w = weight_func(u) # Handle zero weights (for numerical stability) w = np.maximum(w, 1e-10) # Weighted least squares W = np.diag(w) XtWX = X.T @ W @ X XtWy = X.T @ W @ y # Add small regularization for stability beta_new = linalg.solve(XtWX + 1e-10 * np.eye(p), XtWy) # Check convergence change = np.max(np.abs(beta_new - beta)) if change < tol: info['converged'] = True info['iterations'] = iteration + 1 beta = beta_new break beta = beta_new info['final_weights'] = w info['final_scale'] = sigma return beta, info # Example comparisonnp.random.seed(42)n = 100 # Generate data with outliersX = np.column_stack([np.ones(n), np.random.randn(n)])true_beta = np.array([2.0, 3.0])y = X @ true_beta + 0.5 * np.random.randn(n) # Add 10% outliersoutlier_idx = np.random.choice(n, 10, replace=False)y[outlier_idx] += np.random.choice([-1, 1], 10) * 15 # Fit various M-estimatorsbeta_ols = linalg.lstsq(X, y)[0]beta_huber, _ = m_estimator_irls(X, y, weight_huber)beta_tukey, info = m_estimator_irls(X, y, weight_tukey) print("True coefficients:", true_beta)print(f"OLS: [{beta_ols[0]:.3f}, {beta_ols[1]:.3f}]")print(f"Huber: [{beta_huber[0]:.3f}, {beta_huber[1]:.3f}]")print(f"Tukey: [{beta_tukey[0]:.3f}, {beta_tukey[1]:.3f}]")print(f"Tukey outlier weights: {info['final_weights'][outlier_idx]}")With many M-estimators available, how should practitioners choose? The decision depends on your data characteristics, robustness requirements, and computational constraints.
Decision framework:
For most applications, Huber regression (monotone) or Tukey's bisquare (redescending) covers the vast majority of use cases. Start with Huber for its simplicity and guaranteed convergence; upgrade to Tukey when you need complete outlier rejection.
What's next:
M-estimators provide elegant theory but face challenges when outliers are extreme or when outliers occur in the X-space (leverage points). The next page introduces RANSAC (Random Sample Consensus)—a fundamentally different approach that explicitly models data as inliers plus outliers, achieving remarkable robustness even under extreme contamination.
You now understand M-estimators as the general framework for robust estimation, including the roles of ρ, ψ, and weight functions, the distinction between monotone and redescending estimators, and how to select appropriate estimators for different scenarios.