Quantile Regression - Learning Module

Loading content...

0/278

Robust Estimation

Robustness: When Data Misbehaves

Real-world data is messy. Outliers lurk in datasets due to measurement errors, data entry mistakes, exceptional circumstances, or heavy-tailed distributions that generate extreme observations naturally. Traditional least squares regression—optimized for the mean—is notoriously sensitive to such violations.

Quantile regression, particularly median regression, offers a fundamentally robust alternative.

The robustness of quantile regression isn't an afterthought or an add-on—it emerges naturally from the loss function's geometry. Where squared loss penalizes large residuals quadratically (making extreme observations disproportionately influential), quantile loss penalizes only linearly, bounding the influence of any single point.

This page explores:

The mathematical basis for quantile regression's robustness
Influence functions and breakdown points
Comparison with other robust methods
When to prioritize robustness in practice

What You Will Learn

By the end of this page, you will understand why quantile regression is robust, how to quantify robustness using influence functions and breakdown points, how it compares to other robust estimators, and when robustness should drive your methodological choices.

The Fragility of Ordinary Least Squares

Before appreciating quantile regression's robustness, we must understand what makes OLS fragile.

The OLS Objective:

$$\hat{\beta}{OLS} = \arg\min\beta \sum_{i=1}^n (y_i - x_i^\top \beta)^2$$

Why Squaring Creates Sensitivity:

Consider two residuals: $r_1 = 1$ and $r_2 = 10$.

Squared contributions: $1^2 = 1$ vs. $10^2 = 100$
The outlier contributes 100× more to the loss!

This means a single extreme observation can dominate the minimization, pulling $\hat{\beta}$ away from the pattern in the majority of the data.

A Striking Example:

ols_fragility_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, QuantileRegressor
 
np.random.seed(42)
 
# Generate clean data
n = 50
X = np.random.uniform(0, 10, n).reshape(-1, 1)
y = 2 * X.ravel() + 5 + np.random.normal(0, 1, n)
 
# Add one outlier
X_outlier = np.vstack([X, [[5]]])
y_outlier = np.append(y, [50])  # Massive outlier
 
# Fit OLS and median regression
ols_clean = LinearRegression().fit(X, y)
ols_outlier = LinearRegression().fit(X_outlier, y_outlier)
qr_outlier = QuantileRegressor(quantile=0.5, alpha=0, solver='highs').fit(X_outlier, y_outlier)
 
# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
X_line = np.linspace(0, 10, 100).reshape(-1, 1)
 
ax1 = axes[0]
ax1.scatter(X, y, alpha=0.7, s=50, label='Clean Data')
ax1.plot(X_line, ols_clean.predict(X_line), 'b-', linewidth=2, label='OLS (Clean)')
ax1.set_xlabel('X')
ax1.set_ylabel('y')
ax1.set_title('Clean Data: OLS Works Well')
ax1.legend()
ax1.grid(True, alpha=0.3)
 
ax2 = axes[1]
ax2.scatter(X, y, alpha=0.7, s=50, label='Clean Data')
ax2.scatter([5], [50], color='red', s=200, marker='*', label='Single Outlier', zorder=5)
ax2.plot(X_line, ols_outlier.predict(X_line), 'r--', linewidth=2, label='OLS (with outlier)')
ax2.plot(X_line, qr_outlier.predict(X_line), 'g-', linewidth=2, label='Median Regression (τ=0.5)')
ax2.plot(X_line, ols_clean.predict(X_line), 'b:', linewidth=2, label='OLS (clean, reference)')
ax2.set_xlabel('X')
ax2.set_ylabel('y')
ax2.set_title('One Outlier Destroys OLS; Median Regression Resists')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_ylim(-5, 55)
 
plt.tight_layout()
plt.show()
 
# Coefficient comparison
print("Slope Estimates (True = 2.0):")
print(f"  OLS (clean):         {ols_clean.coef_[0]:.3f}")
print(f"  OLS (with outlier):  {ols_outlier.coef_[0]:.3f}")
print(f"  Median (with outlier): {qr_outlier.coef_[0]:.3f}")

One Point Can Ruin Everything

A single outlier among 50 observations can completely distort the OLS fit. The median regression line remains virtually unchanged. This is not a minor difference—it's the difference between a useful model and a useless one.

Influence Functions: Quantifying Sensitivity

The influence function is a fundamental tool from robust statistics that measures how sensitive an estimator is to infinitesimal contamination at a particular point.

Definition (Influence Function):

For an estimator $T$ at distribution $F$, the influence function at point $z$ is:

$$\text{IF}(z; T, F) = \lim_{\varepsilon \to 0} \frac{T((1-\varepsilon)F + \varepsilon \delta_z) - T(F)}{\varepsilon}$$

where $\delta_z$ is a point mass at $z$.

Interpretation: IF$(z)$ measures how the estimator changes when an infinitesimal fraction of data comes from a point mass at $z$. If IF$(z)$ grows unboundedly as $z \to \infty$, the estimator is sensitive to extreme observations.

Influence Functions for Key Estimators:

Influence Functions for Different Estimators
Estimator	Influence Function IF(y)	Boundedness
Sample Mean	y − μ	Unbounded: IF → ±∞ as y → ±∞
Sample Median	sign(y − μ) / (2f(μ))	Bounded: IF ∈ {−c, +c}
OLS slope	Proportional to x(y − xβ)	Unbounded in both x and y
Quantile Regression	Bounded transformation	Bounded in y, depends on x distribution

The Mean's Unbounded Influence:

For the sample mean: $$\text{IF}(y) = y - \mu$$

As $y \to \infty$, IF$(y) \to \infty$. A single extreme observation has arbitrarily large influence.

The Median's Bounded Influence:

For the sample median with symmetric density $f$: $$\text{IF}(y) = \frac{\text{sign}(y - \mu)}{2f(\mu)}$$

This is bounded! No matter how extreme $y$ is, it contributes the same amount to moving the median. Once an observation falls above (or below) the median, moving it further has no additional effect.

Quantile Regression's Bounded Influence:

For quantile regression at level $\tau$, the influence function in the $y$-direction is bounded. The loss function $\rho_\tau(y - x^\top \beta)$ grows only linearly in $|y|$, leading to:

$$\frac{\partial}{\partial y} \rho_\tau(y - x^\top \beta) = \tau - \mathbb{1}{y < x^\top \beta}$$

This derivative is bounded in $[-1, 1]$, ensuring stable estimates.

Bounded Influence = Robustness

A bounded influence function is the hallmark of a robust estimator. It guarantees that no single observation—no matter how extreme—can arbitrarily distort the estimate. Quantile regression achieves this naturally through its linear (rather than quadratic) loss.

Breakdown Point: The Ultimate Robustness Measure

While the influence function measures sensitivity to infinitesimal contamination, the breakdown point measures resistance to gross contamination.

Definition (Finite-Sample Breakdown Point):

The breakdown point is the maximum fraction of data that can be arbitrarily corrupted while the estimator remains bounded:

$$\varepsilon^* = \min\left{\frac{m}{n} : \sup_{|y'_1|, ..., |y'm| \to \infty} |T(y_1, ..., y{n-m}, y'_1, ..., y'_m)| = \infty\right}$$

Breakdown Points for Common Estimators:

Breakdown Points of Common Estimators
Estimator	Breakdown Point	Interpretation
Sample Mean	ε* = 0 (or 1/n)	A single outlier can corrupt the mean arbitrarily
Sample Median	ε* = 0.5	Up to ~50% of data can be outliers before breakdown
OLS Regression	ε* = 1/n	One high-leverage outlier can destroy the fit
Median Regression (τ=0.5)	ε* ≈ 0.5	Approximately 50% breakdown point
General Quantile (τ)	ε* ≈ min(τ, 1-τ)	Depends on τ; most robust at τ=0.5
Trimmed Mean (10%)	ε* = 0.1	10% of extremes in each tail can be corrupted
LMS/LTS Regression	ε* ≈ 0.5	High breakdown, but computationally expensive

Why 50% is the Maximum:

No reasonable estimator can have breakdown point above 50%. If more than half the data is corrupted, the "outliers" become the majority, and distinguishing signal from contamination becomes impossible.

Median Regression's High Breakdown:

Median regression (quantile regression with $\tau = 0.5$) achieves near-optimal breakdown:

$$\varepsilon^* \approx \frac{1}{2} \cdot \frac{1}{p+1}$$

where $p$ is the number of predictors. For univariate regression ($p=1$), this gives $\varepsilon^* \approx 0.25$, meaning up to ~25% of data can be arbitrarily corrupted.

Extreme Quantiles Are Less Robust:

For $\tau = 0.1$ (10th percentile), approximately $\min(0.1, 0.9) = 0.1$ fraction can be corrupted. Extreme quantiles estimate tail behavior, which requires extreme observations to be genuine—making robustness to tail contamination limited.

Robustness vs. Tail Estimation Trade-off

Extreme quantile estimation inherently requires observations from the tails. If tail observations are outliers, they contain genuine information about the tail distribution. If they are errors, they corrupt the estimate. Distinguishing these cases is fundamentally difficult—hence the lower breakdown point for extreme τ.

Comparison with Other Robust Regression Methods

Quantile regression is one member of a broader family of robust regression methods. Understanding the alternatives clarifies when each is appropriate.

M-Estimators (Huber, Tukey):

M-estimators generalize maximum likelihood by solving: $$\hat{\beta} = \arg\min_\beta \sum_{i=1}^n \rho(y_i - x_i^\top \beta)$$

for various $\rho$ functions:

Huber loss: Quadratic near zero, linear in tails. Robust to y-outliers but sensitive to x-outliers (leverage points).
Tukey's biweight: Completely downweights extreme residuals (redescending). Very robust but can be multimodal.

Least Median of Squares (LMS) / Least Trimmed Squares (LTS):

LMS: Minimizes the median of squared residuals
LTS: Minimizes the sum of the smallest $h$ squared residuals

Both achieve 50% breakdown but are computationally expensive (combinatorial search).

Comparison of Robust Regression Methods
Method	Breakdown Point	Y-Outliers	X-Outliers	Computation	Efficiency
OLS	0	Sensitive	Sensitive	O(np²)	100% (Gaussian)
Median Reg. (τ=0.5)	~50%/(p+1)	Robust	Moderate	O(n log n)	64% (Gaussian)
Huber M-estimator	~0	Robust	Sensitive	O(np²)	95% (Gaussian)
Tukey M-estimator	~0	Very robust	Sensitive	O(np²)	Variable
LMS/LTS	~50%	Very robust	Robust	Exponential	Low
MM-estimator	~50%	Very robust	Robust	Iterative	High

Quantile Regression Advantages

•Estimates quantiles directly (interpretable)
•Adapts to heteroscedasticity automatically
•Efficient linear programming solutions
•Distribution-free inference available
•Robustness is a natural byproduct

Quantile Regression Limitations

•Lower efficiency than OLS when errors are Gaussian
•Moderate sensitivity to x-outliers (leverage points)
•Extreme quantiles (τ near 0 or 1) less robust
•Separate fit needed for each quantile
•Non-differentiable loss complicates some analyses

Choosing a Robust Method

Use quantile regression when: (1) you want interpretable quantile estimates, (2) heteroscedasticity is present, or (3) you need distribution-free inference. Use Huber M-estimation for near-Gaussian data with occasional outliers. Use MM-estimators when high breakdown and high efficiency are both critical.

Efficiency: The Cost of Robustness

Robustness isn't free. There's a fundamental trade-off between robust estimators and efficient ones.

Definition (Relative Efficiency):

The asymptotic relative efficiency (ARE) of estimator $T_1$ relative to $T_2$ is:

$$\text{ARE}(T_1, T_2) = \frac{\text{Var}(T_2)}{\text{Var}(T_1)}$$

Higher ARE means $T_1$ is more efficient (lower variance).

Efficiency of Median Regression:

Under Gaussian errors, median regression has: $$\text{ARE}(\text{Median}, \text{Mean}) = \frac{2}{\pi} \approx 0.637$$

Median regression is 64% as efficient as OLS when errors are truly Gaussian. You need approximately 1/0.637 ≈ 1.57× more data to achieve the same precision.

But Under Heavy Tails:

For heavier-tailed distributions (Laplace, t-distribution), the efficiency comparison reverses:

Relative Efficiency of Median vs. Mean by Distribution
Distribution	ARE(Median, Mean)	Interpretation
Normal (Gaussian)	0.637 (64%)	Mean is 1.57× more efficient
Laplace (double exponential)	2.0 (200%)	Median is 2× more efficient
t-distribution (df=5)	1.07 (107%)	Approximately equal
t-distribution (df=3)	∞	Mean variance is infinite!
Cauchy	undefined/∞	Mean doesn't exist; median still works
Contaminated Normal (10%)	1	Median more efficient with outliers

The Key Insight:

If errors are Gaussian: OLS is more efficient, and the efficiency loss from quantile regression is the price of unused robustness.
If errors are heavy-tailed: Quantile regression can be MORE efficient than OLS, plus it's robust.
If you're unsure: The robustness insurance of quantile regression may be worth the mild efficiency loss under Gaussianity.

Gross Error Sensitivity:

In practice, data rarely follows perfect Gaussian assumptions. A more realistic model is the contaminated normal:

$$Y \sim (1-\varepsilon) \cdot N(\mu, \sigma^2) + \varepsilon \cdot N(\mu, k^2\sigma^2)$$

where $\varepsilon$ fraction of observations come from a high-variance component. With even 5% contamination, robust estimators often outperform OLS.

Insurance Interpretation

Think of robustness as insurance. Under ideal conditions (Gaussian errors), you pay a premium (64% efficiency). Under adverse conditions (outliers, heavy tails), the insurance pays off—your estimates remain valid while OLS collapses.

X-Outliers and Leverage Points

An important nuance: quantile regression is robust to y-direction outliers but has only moderate resistance to x-direction outliers (leverage points).

Definitions:

Y-outlier (vertical outlier): Observation with unusual $y$ value given its $x$
X-outlier (leverage point): Observation with unusual $x$ value
High-leverage outlier: Both unusual $x$ AND unusual $y|x$—the most dangerous

Why Leverage Points Are Problematic:

In regression, observations with extreme $x$ values have more influence on the slope because they have longer "lever arms." This is true for both OLS and quantile regression.

Geometric Intuition:

Imagine fitting a line to data. A point far from the center of the x-distribution acts like a pivot point—moving its y-value rotates the line substantially. Points near the center of x affect the line much less.

leverage_points_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, QuantileRegressor
 
np.random.seed(42)
 
# Generate data with X in [0, 5]
n = 50
X = np.random.uniform(0, 5, n).reshape(-1, 1)
y = 2 * X.ravel() + 5 + np.random.normal(0, 1, n)
 
# Scenario 1: Vertical outlier near center of X
X_v = np.vstack([X, [[2.5]]])  # x near center
y_v = np.append(y, [50])       # extreme y
 
# Scenario 2: High-leverage outlier (extreme x AND y)
X_l = np.vstack([X, [[10]]])   # x far from data
y_l = np.append(y, [5])        # y not following pattern
 
# Fit models
ols_clean = LinearRegression().fit(X, y)
ols_v = LinearRegression().fit(X_v, y_v)
ols_l = LinearRegression().fit(X_l, y_l)
qr_v = QuantileRegressor(quantile=0.5, alpha=0, solver='highs').fit(X_v, y_v)
qr_l = QuantileRegressor(quantile=0.5, alpha=0, solver='highs').fit(X_l, y_l)
 
# Plots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
X_line = np.linspace(-1, 12, 100).reshape(-1, 1)
 
ax1 = axes[0]
ax1.scatter(X, y, alpha=0.7, s=50)
ax1.scatter([2.5], [50], color='red', s=200, marker='*', label='Vertical Outlier', zorder=5)
ax1.plot(X_line, ols_clean.predict(X_line), 'b:', linewidth=2, label='OLS (clean)')
ax1.plot(X_line, ols_v.predict(X_line), 'r--', linewidth=2, label='OLS (with outlier)')
ax1.plot(X_line, qr_v.predict(X_line), 'g-', linewidth=2, label='Median Reg.')
ax1.set_xlabel('X')
ax1.set_ylabel('y')
ax1.set_title('Vertical Outlier: Median Regression Resists')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xlim(-1, 12)
ax1.set_ylim(-5, 55)
 
ax2 = axes[1]
ax2.scatter(X, y, alpha=0.7, s=50)
ax2.scatter([10], [5], color='orange', s=200, marker='*', label='Leverage Point', zorder=5)
ax2.plot(X_line, ols_clean.predict(X_line), 'b:', linewidth=2, label='OLS (clean)')
ax2.plot(X_line, ols_l.predict(X_line), 'r--', linewidth=2, label='OLS (with outlier)')
ax2.plot(X_line, qr_l.predict(X_line), 'g-', linewidth=2, label='Median Reg.')
ax2.set_xlabel('X')
ax2.set_ylabel('y')
ax2.set_title('Leverage Point: Both Methods Affected!')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_xlim(-1, 12)
ax2.set_ylim(-5, 55)
 
plt.tight_layout()
plt.show()
 
print("Slope Estimates (True = 2.0):")
print(f"  Clean OLS:            {ols_clean.coef_[0]:.3f}")
print(f"  Vertical outlier OLS: {ols_v.coef_[0]:.3f}")
print(f"  Vertical outlier QR:  {qr_v.coef_[0]:.3f}")
print(f"  Leverage point OLS:   {ols_l.coef_[0]:.3f}")
print(f"  Leverage point QR:    {qr_l.coef_[0]:.3f}")

Leverage Points Require Special Attention

Quantile regression provides excellent protection against vertical outliers but only moderate protection against high-leverage points. For complete robustness, consider: (1) detecting and examining leverage points, (2) using bounded-influence regression, or (3) applying robust distance measures in the x-space.

When to Prioritize Robustness

Not every analysis needs robust methods. Here's guidance on when robustness should be a priority.

Situations Favoring Robust Methods

•Data quality concerns: Manual data entry, sensor errors, or unvalidated sources
•Heavy-tailed outcomes: Financial returns, insurance claims, healthcare costs
•Measurement errors: Unknown or variable measurement precision
•Outliers are real but not of interest: Exceptions to the general pattern exist but the pattern is what matters
•Exploratory analysis: Initial model building where influence of anomalies is unknown
•High-stakes decisions: Cannot afford to be misled by data anomalies
•Small samples: Individual outliers have larger impact

Situations Where OLS May Suffice

•Clean, validated data: Outliers have been systematically identified and handled
•Known Gaussian errors: Domain knowledge confirms approximate normality
•Large samples with quality control: Individual outliers have minimal impact
•Efficiency is paramount: Maximum precision needed, robustness is secondary
•Closed-form solutions required: Some downstream analyses require OLS-specific formulas

A Practical Decision Framework:

Always examine residuals before deciding. Large residuals suggest potential outliers.
Compare OLS and median regression. If estimates differ substantially, outliers are influencing OLS.
Assess downstream impact. If the analysis drives important decisions, robustness is worth the efficiency cost.
Consider the worst case. What happens if there are undetected outliers? Robust methods provide insurance.
Report both when uncertain. Presenting OLS and quantile regression side-by-side illuminates sensitivity.

Default to Robustness

In practice, the efficiency loss from using robust methods on clean data is modest (64% for median regression). The protection against undetected outliers is substantial. When in doubt, robust methods provide better risk-adjusted performance.

Summary and Module Conclusion

We have explored the robustness properties of quantile regression—an essential advantage for real-world applications.

Key Takeaways from This Page

•OLS is fragile — Squared loss makes it sensitive to outliers; one extreme point can corrupt estimates.
•Influence functions quantify sensitivity — Quantile regression has bounded influence; OLS has infinite influence.
•Breakdown point measures gross-error resistance — Median regression can withstand ~25-50% contamination vs. OLS's ~0%.
•Robustness has an efficiency cost — 64% efficiency under Gaussianity, but often more efficient under heavy tails.
•Leverage points require care — Quantile regression resists y-outliers well but has moderate x-outlier sensitivity.
•Default to robustness when uncertain — The insurance against data problems usually outweighs efficiency loss.

Module Conclusion: Quantile Regression Mastery

Over these five pages, you have built a comprehensive understanding of quantile regression:

Quantile Loss Function: The asymmetric check function that estimates conditional quantiles
Pinball Loss: The scoring-rule perspective enabling deep learning integration
Conditional Quantiles: Modeling the entire conditional distribution, not just the mean
Prediction Intervals: Distribution-free uncertainty quantification with conformal guarantees
Robust Estimation: Automatic resistance to outliers and heavy-tailed errors

Quantile regression complements mean regression by providing:

A richer picture of covariate effects across the distribution
Automatic heteroscedasticity handling
Robust inference without distributional assumptions
Prediction intervals that adapt to local variance

Module Complete

Congratulations! You have mastered quantile regression—a powerful technique that models the entire conditional distribution, provides robust estimates, and enables principled uncertainty quantification. This knowledge extends your regression toolkit far beyond mean-focused methods, preparing you for the complexities of real-world data analysis.