Machine LearningAdvanced Regression

Quantile Regression

LevelAdvanced

Duration75 mins

TopicAdvanced Regression

1 / 5

Quantile Loss Function

Beyond the Mean: A Paradigm Shift in Regression

Every regression technique you've encountered thus far—from simple linear regression to sophisticated kernel methods—shares a common objective: estimating the conditional mean. When you predict house prices or stock returns, you're implicitly asking, "What is the average value of y given x?"

But this singular focus on the mean tells only part of the story. In many real-world scenarios, knowing the average is insufficient or even misleading:

Risk Management: A financial analyst needs to understand not just the expected portfolio return, but the worst-case scenarios—the 5th percentile of potential losses.
Healthcare: A clinician prescribing medication cares about the typical response, but also about patients who respond poorly (lower quantiles) or exceptionally well (upper quantiles).
Supply Chain: An operations manager must ensure adequate inventory not just for average demand, but for peak demand scenarios.
Climate Science: Modeling extreme weather events requires understanding the tails of the distribution, not the center.

Quantile regression provides the mathematical framework to address these questions directly, allowing us to model any quantile of the conditional distribution—not just the mean.

What You Will Learn

By the end of this page, you will understand the quantile loss function—the foundation of quantile regression. You'll see how this elegantly asymmetric loss function enables estimation of any desired quantile, why it reduces to familiar cases at special values, and how it fundamentally differs from squared error loss.

The Limitations of Mean Regression

To appreciate quantile regression, we must first understand what we sacrifice when focusing exclusively on conditional means.

The Standard Regression Setup:

Consider the familiar regression model:

$$y_i = f(x_i) + \varepsilon_i$$

where $y_i$ is the response, $x_i$ are features, $f(x_i)$ is the conditional mean function, and $\varepsilon_i$ is zero-mean noise. Ordinary Least Squares (OLS) estimates $f$ by minimizing:

$$\hat{f}{OLS} = \arg\min_f \sum{i=1}^{n} (y_i - f(x_i))^2$$

This estimator has a well-known property: it provides the conditional mean $\mathbb{E}[Y \mid X = x]$.

Why the mean can be inadequate:

Critical Limitations of Mean-Based Regression

•Sensitivity to Outliers: The squared error loss gives disproportionate influence to extreme observations, pulling the estimated mean toward outliers.
•Incomplete Distributional Picture: The mean tells us nothing about dispersion, skewness, or tail behavior of the conditional distribution.
•Homoscedasticity Assumption: OLS assumes constant variance across all values of x. When variance changes (heteroscedasticity), the mean becomes less informative.
•Inappropriate for Skewed Distributions: For highly skewed outcomes (e.g., income, medical costs), the mean may not represent a 'typical' observation.
•Risk Assessment Impossible: Without modeling different quantiles, we cannot assess probabilities of extreme events.

A Motivating Example:

Consider predicting medical costs based on patient characteristics. The distribution of healthcare expenditures is notoriously right-skewed: most patients incur modest costs, but a small fraction generates extremely high bills. The conditional mean—influenced heavily by these expensive cases—may not represent what most patients will actually pay.

If an insurer sets premiums based only on the mean, they might:

Overcharge typical patients (mean > median)
Underestimate capital reserves needed for high-cost cases
Miss crucial risk factors that affect the tails differently than the center

Quantile regression provides a richer picture by estimating the entire conditional distribution.

A Complementary Approach

Quantile regression doesn't replace mean regression—it complements it. When distributional effects vary across different parts of the outcome distribution, or when robustness to outliers is paramount, quantile regression becomes an essential tool.

Quantiles and Their Properties

Before constructing the quantile loss function, let's establish rigorous definitions and properties of quantiles themselves.

Definition (Quantile):

For a random variable $Y$ with cumulative distribution function (CDF) $F_Y(y) = P(Y \leq y)$, the $\tau$-th quantile $q_\tau$ is defined as:

$$q_\tau = F_Y^{-1}(\tau) = \inf{y : F_Y(y) \geq \tau}$$

for $\tau \in (0, 1)$.

Key Properties of Quantiles:

Fundamental Quantile Properties

•Probability Interpretation: Exactly $\tau$ fraction of the distribution lies at or below $q_\tau$. Thus $P(Y \leq q_{0.25}) = 0.25$ for the first quartile.
•Monotonicity: If $\tau_1 < \tau_2$, then $q_{\tau_1} \leq q_{\tau_2}$. Higher quantiles correspond to larger values.
•Location Equivariance: For any constant $a$, the $\tau$-th quantile of $Y + a$ equals $q_\tau + a$.
•Scale Equivariance: For $c > 0$, the $\tau$-th quantile of $cY$ equals $c \cdot q_\tau$.
•Generalized Inverse: The quantile function handles discrete distributions and distributions with flat regions in the CDF.

Common Quantiles and Their Names:

Standard Quantile Terminology
Quantile Level τ	Name	Interpretation
τ = 0.5	Median	The middle value; 50% of observations lie below
τ = 0.25, 0.75	Quartiles (Q1, Q3)	Divide data into four equal parts
τ = 0.1, 0.9	Deciles (D1, D9)	Mark 10% and 90% thresholds
τ = 0.01, 0.99	Percentiles (P1, P99)	Extreme quantiles for tail analysis
τ = 0.05	Lower 5th Percentile	Common in risk assessment (VaR)

Conditional Quantiles:

In regression, we're interested in conditional quantiles—the quantiles of $Y$ given $X = x$:

$$Q_\tau(Y \mid X = x) = \inf{y : P(Y \leq y \mid X = x) \geq \tau}$$

For the conditional mean, we have $\mathbb{E}[Y \mid X = x]$. Quantile regression generalizes this to:

$$Q_\tau(Y \mid X = x) = x^\top \beta(\tau)$$

where $\beta(\tau)$ is the coefficient vector for quantile $\tau$. Note that the coefficients themselves depend on $\tau$—different quantiles have different regression coefficients!

Quantile-Specific Coefficients

Unlike OLS where a single β describes the entire conditional mean, quantile regression produces a family of coefficient vectors β(τ)—one for each quantile τ ∈ (0,1). This reveals how the effect of covariates varies across the distribution of the response.

Deriving the Quantile Loss Function

The choice of loss function determines what aspect of the conditional distribution we estimate. OLS uses squared loss, which minimizes to the mean. What loss function yields quantiles?

The Key Insight:

To estimate the $\tau$-th quantile, we need a loss function that:

Penalizes underestimation and overestimation differently
Achieves its minimum at the $\tau$-th population quantile
Weights errors asymmetrically based on $\tau$

The Check Function (Quantile Loss):

The quantile loss function (also called the check function or tilted absolute value function) for quantile level $\tau \in (0, 1)$ is defined as:

$$\rho_\tau(u) = u \cdot (\tau - \mathbb{1}{u < 0})$$

where $\mathbb{1}{u < 0}$ is the indicator function that equals 1 if $u < 0$ and 0 otherwise.

Equivalent piecewise formulation:

$$\rho_\tau(u) = \begin{cases} \tau \cdot u & \text{if } u \geq 0 \ (\tau - 1) \cdot u & \text{if } u < 0 \end{cases}$$

Since $\tau - 1$ is negative and $u < 0$ in the second case, this makes $(\tau - 1) \cdot u$ positive, as required for a loss function.

Understanding the Asymmetry:

Let's analyze how the loss penalizes positive errors (underestimation: $y > \hat{y}$, so $u = y - \hat{y} > 0$) versus negative errors (overestimation: $y < \hat{y}$, so $u < 0$):

Positive residuals $u > 0$: penalty = $\tau \cdot |u|$
Negative residuals $u < 0$: penalty = $(1 - \tau) \cdot |u|$

For $\tau = 0.9$ (estimating the 90th percentile):

Underestimation ($u > 0$): penalty = $0.9 |u|$
Overestimation ($u < 0$): penalty = $0.1 |u|$

The optimizer is pushed to make underestimation errors rare because they are penalized 9× more heavily than overestimation. This drives the estimate toward the 90th percentile—a value exceeded by only 10% of observations.

Asymmetric Penalties for Different Quantiles
τ	Underestimation Penalty (u > 0)	Overestimation Penalty (u < 0)	Effect
0.10	0.10 × \|u\|	0.90 × \|u\|	Strongly penalizes overestimation; estimates lower quantile
0.25	0.25 × \|u\|	0.75 × \|u\|	Penalizes overestimation more; estimates Q1
0.50	0.50 × \|u\|	0.50 × \|u\|	Symmetric penalty → estimates median
0.75	0.75 × \|u\|	0.25 × \|u\|	Penalizes underestimation more; estimates Q3
0.90	0.90 × \|u\|	0.10 × \|u\|	Strongly penalizes underestimation; estimates upper quantile

The Median as a Special Case

When τ = 0.5, both types of errors are penalized equally, and the quantile loss reduces to half the absolute error: ρ₀.₅(u) = 0.5|u|. Minimizing this yields the conditional median—which is why median regression is also known as Least Absolute Deviations (LAD) regression.

Mathematical Properties of Quantile Loss

The quantile loss function possesses several mathematical properties that make it suitable for optimization and statistical inference.

Property 1: Convexity

The quantile loss $\rho_\tau(u)$ is a convex function of $u$. This is easily verified:

It is piecewise linear with slopes $\tau$ (for $u \geq 0$) and $\tau - 1$ (for $u < 0$)
Since $\tau - 1 < 0 < \tau$ for $\tau \in (0,1)$, the function has a 'V' shape with the vertex at $u = 0$
A piecewise linear function with increasing slopes is convex

Implication: The quantile regression objective function is convex in the parameters, guaranteeing a unique global minimum (or a convex set of minimizers).

Property 2: Population Minimizer

The $\tau$-th population quantile minimizes the expected quantile loss. Formally:

$$q_\tau = \arg\min_q \mathbb{E}[\rho_\tau(Y - q)]$$

Proof Sketch:

Take the derivative (in the distributional sense) of $\mathbb{E}[\rho_\tau(Y - q)]$ with respect to $q$:

$$\frac{d}{dq} \mathbb{E}[\rho_\tau(Y - q)] = -\tau P(Y > q) + (1 - \tau) P(Y < q)$$

$$= -\tau(1 - F(q)) + (1 - \tau)F(q) = F(q) - \tau$$

Setting this to zero gives $F(q) = \tau$, hence $q = F^{-1}(\tau) = q_\tau$. □

Property 3: Subgradient Structure

At $u = 0$, the quantile loss is not differentiable. However, it has a well-defined subgradient:

$$\partial \rho_\tau(u) = \begin{cases} {\tau} & \text{if } u > 0 \ [\tau - 1, \tau] & \text{if } u = 0 \ {\tau - 1} & \text{if } u < 0 \end{cases}$$

The subgradient at zero is the interval $[\tau - 1, \tau]$, which always contains zero when $\tau \in (0, 1)$. This structure is exploited by optimization algorithms.

Property 4: Robustness

Unlike squared loss, which grows quadratically with $|u|$, the quantile loss grows only linearly. This bounded influence function provides:

Natural robustness to outliers
Bounded impact from any single observation
Similar behavior to robust M-estimators

Connection to Robust Statistics

The linear growth of quantile loss gives it a bounded influence function—extreme observations cannot arbitrarily distort estimates. This connects quantile regression to the broader field of robust statistics, where resistance to outliers is a primary design goal.

Comparison with Squared Loss

Understanding the differences between quantile loss and squared loss illuminates when each is appropriate.

The Two Loss Functions Side by Side:

$$L_{squared}(u) = u^2 \qquad \text{vs.} \qquad \rho_\tau(u) = u(\tau - \mathbb{1}{u < 0})$$

Squared Loss (L₂)

•Symmetric: equal penalty for over/underestimate
•Quadratic growth: extreme errors dominate
•Differentiable everywhere
•Minimizer: conditional mean E[Y|X]
•Sensitive to outliers
•Requires finite second moment
•Gaussian MLE interpretation

Quantile Loss (ρτ)

•Asymmetric: weighted by τ vs. (1-τ)
•Linear growth: bounded influence
•Non-differentiable at u = 0
•Minimizer: conditional τ-quantile
•Robust to outliers
•Requires only finite first moment
•No parametric distribution assumed

Geometric Interpretation:

The squared loss creates a symmetric bowl that pushes the estimate toward the center of the residual distribution. The quantile loss creates an asymmetric 'V' that tilts toward one side, pushing the estimate toward a specific quantile.

When to Choose Each:

Use Squared Loss (OLS) when:

Errors are roughly symmetric and Gaussian
The conditional mean is the quantity of interest
Computational efficiency is paramount (closed-form solution)
Homoscedasticity is a reasonable assumption

Use Quantile Loss when:

Modeling different parts of the distribution
Data contains outliers or heavy-tailed noise
Heteroscedasticity is present (variance changes with x)
Risk or tail behavior is the primary concern
Distribution-free inference is desired

The Non-Differentiability Challenge

Unlike squared loss, quantile loss lacks a continuous derivative at zero. This prevents closed-form solutions and requires specialized optimization algorithms—typically linear programming or subgradient methods. We'll address these computational aspects in later pages.

Visualization and Building Intuition

Visual understanding of the quantile loss function solidifies these concepts.

quantile_loss_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
import matplotlib.pyplot as plt
 
def quantile_loss(u, tau):
    """
    Compute the quantile loss (check function) for residual u and quantile level tau.
    
    Parameters:
    -----------
    u : array-like
        Residuals (y - y_hat)
    tau : float
        Quantile level in (0, 1)
    
    Returns:
    --------
    array-like
        Quantile loss values
    """
    return u * (tau - (u < 0).astype(float))
 
# Alternative equivalent formulation
def quantile_loss_explicit(u, tau):
    """Explicit piecewise formulation for clarity."""
    return np.where(u >= 0, tau * u, (tau - 1) * u)
 
# Generate residual values
u = np.linspace(-3, 3, 1000)
 
# Compute losses for different quantile levels
tau_values = [0.1, 0.25, 0.5, 0.75, 0.9]
 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Plot 1: Quantile loss for different tau values
ax1 = axes[0]
for tau in tau_values:
    loss = quantile_loss(u, tau)
    ax1.plot(u, loss, label=f'τ = {tau}', linewidth=2)
 
ax1.set_xlabel('Residual u = y - ŷ', fontsize=12)
ax1.set_ylabel('Loss ρ_τ(u)', fontsize=12)
ax1.set_title('Quantile Loss Functions for Different τ', fontsize=14)
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.axhline(y=0, color='k', linewidth=0.5)
ax1.axvline(x=0, color='k', linewidth=0.5)
 
# Plot 2: Compare squared loss vs quantile loss (τ = 0.5)
ax2 = axes[1]
squared_loss = u ** 2
median_loss = quantile_loss(u, 0.5)
absolute_loss = np.abs(u)
 
ax2.plot(u, squared_loss, label='Squared Loss (L₂)', linewidth=2)
ax2.plot(u, absolute_loss, label='Absolute Loss (L₁)', linewidth=2, linestyle='--')
ax2.plot(u, 2 * median_loss, label='Quantile Loss (τ=0.5) × 2', linewidth=2, linestyle=':')
 
ax2.set_xlabel('Residual u', fontsize=12)
ax2.set_ylabel('Loss', fontsize=12)
ax2.set_title('Squared vs Absolute (Quantile τ=0.5) Loss', fontsize=14)
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_ylim(0, 5)
 
plt.tight_layout()
plt.show()
 
# Demonstrate asymmetry numerically
print("\nAsymmetric Penalties for τ = 0.9:")
print("-" * 50)
for residual in [-2, -1, -0.5, 0.5, 1, 2]:
    loss = quantile_loss(residual, 0.9)
    error_type = "Overestimate" if residual < 0 else "Underestimate"
    print(f"Residual u = {residual:+.1f}: Loss = {loss:.2f} ({error_type})")

Key Observations from the Visualization:

Asymmetry increases as τ departs from 0.5: For τ = 0.9, the left branch is nearly flat while the right branch is steep.
All curves pass through the origin: Zero residual incurs zero loss, regardless of τ.
Median loss equals half the absolute error: The τ = 0.5 curve matches |u|/2.
Mirror symmetry: ρ_τ(u) and ρ_{1-τ}(-u) are identical—estimating the τ-quantile with one orientation is equivalent to estimating the (1-τ)-quantile with reversed sign.

The Formal Quantile Regression Optimization Problem

With the quantile loss function established, we can now formulate the quantile regression optimization problem.

Problem Statement:

Given data ${(x_i, y_i)}_{i=1}^{n}$, where $x_i \in \mathbb{R}^p$ and $y_i \in \mathbb{R}$, the $\tau$-quantile regression estimator is:

$$\hat{\beta}(\tau) = \arg\min_{\beta \in \mathbb{R}^p} \sum_{i=1}^{n} \rho_\tau(y_i - x_i^\top \beta)$$

Expanding the loss function:

$$\hat{\beta}(\tau) = \arg\min_{\beta} \left[ \tau \sum_{i: y_i \geq x_i^\top \beta} |y_i - x_i^\top \beta| + (1 - \tau) \sum_{i: y_i < x_i^\top \beta} |y_i - x_i^\top \beta| \right]$$

Properties of the Estimator:

Existence: A minimizer always exists (the objective is continuous and coercive).
Uniqueness: The minimizer is generally unique when $X^\top X$ is full rank, but the objective is not strictly convex, so uniqueness isn't guaranteed in degenerate cases.
Robustness: The estimator has bounded influence function, providing resistance to outliers.
Equivariance: The estimator is:
- Location equivariant: Shifting $y$ by $a$ shifts the fitted values by $a$
- Scale equivariant: Scaling $y$ by $c > 0$ scales the coefficients and fitted values by $c$

Interpretation of Coefficients:

For the linear quantile regression model:

$$Q_\tau(Y \mid X = x) = x^\top \beta(\tau)$$

The coefficient $\beta_j(\tau)$ represents:

The change in the τ-th conditional quantile of Y for a one-unit increase in $x_j$, holding other variables constant.

This contrasts with OLS, where $\beta_j$ represents the change in the conditional mean.

Different τ, Different Stories

By estimating β(τ) for multiple values of τ (e.g., 0.1, 0.25, 0.5, 0.75, 0.9), we can see how covariates affect different parts of the response distribution. A covariate might have large effects on upper quantiles but negligible effects on lower quantiles—information completely hidden by OLS.

Summary and Looking Ahead

We have established the mathematical foundation of quantile regression through the quantile loss function.

Key Takeaways

•Mean regression has limitations — It provides only one summary of the conditional distribution and is sensitive to outliers and distributional asymmetry.
•Quantiles characterize distributions — The τ-quantile is the value below which τ fraction of the distribution falls, providing a complete picture of the conditional distribution.
•The quantile loss function is asymmetric — It penalizes errors by τ or (1-τ) depending on sign, pushing the minimizer toward the τ-th quantile.
•Convexity enables optimization — Despite non-differentiability at zero, the convex structure ensures global optima exist.
•Robustness is built-in — Linear growth of loss provides bounded influence, making quantile regression naturally robust to outliers.
•Coefficients have quantile-specific interpretations — β(τ) describes the effect of covariates on the τ-th conditional quantile, not the mean.

What's Next:

In the next page, we'll explore the pinball loss function—an alternative formulation of quantile loss that provides additional insight and computational advantages. We'll see how pinball loss connects to probabilistic forecasting, scoring rules, and modern machine learning frameworks.

Page Complete

You now understand the quantile loss function—the mathematical engine of quantile regression. This asymmetric loss enables estimation of conditional quantiles, providing a richer description of the relationship between features and outcomes than mean regression alone. Next, we'll formalize this through the pinball loss and explore its optimization.

1 / 5

Loading learning content...

Machine LearningAdvanced Regression

Quantile Regression

LevelAdvanced

Duration75 mins

TopicAdvanced Regression

1 / 5

Quantile Loss Function

Beyond the Mean: A Paradigm Shift in Regression

But this singular focus on the mean tells only part of the story. In many real-world scenarios, knowing the average is insufficient or even misleading:

Risk Management: A financial analyst needs to understand not just the expected portfolio return, but the worst-case scenarios—the 5th percentile of potential losses.
Healthcare: A clinician prescribing medication cares about the typical response, but also about patients who respond poorly (lower quantiles) or exceptionally well (upper quantiles).
Supply Chain: An operations manager must ensure adequate inventory not just for average demand, but for peak demand scenarios.
Climate Science: Modeling extreme weather events requires understanding the tails of the distribution, not the center.

Quantile regression provides the mathematical framework to address these questions directly, allowing us to model any quantile of the conditional distribution—not just the mean.

What You Will Learn

The Limitations of Mean Regression

To appreciate quantile regression, we must first understand what we sacrifice when focusing exclusively on conditional means.

The Standard Regression Setup:

Consider the familiar regression model:

$$y_i = f(x_i) + \varepsilon_i$$

where $y_i$ is the response, $x_i$ are features, $f(x_i)$ is the conditional mean function, and $\varepsilon_i$ is zero-mean noise. Ordinary Least Squares (OLS) estimates $f$ by minimizing:

$$\hat{f}{OLS} = \arg\min_f \sum{i=1}^{n} (y_i - f(x_i))^2$$

This estimator has a well-known property: it provides the conditional mean $\mathbb{E}[Y \mid X = x]$.

Why the mean can be inadequate:

Critical Limitations of Mean-Based Regression

•Sensitivity to Outliers: The squared error loss gives disproportionate influence to extreme observations, pulling the estimated mean toward outliers.
•Incomplete Distributional Picture: The mean tells us nothing about dispersion, skewness, or tail behavior of the conditional distribution.
•Homoscedasticity Assumption: OLS assumes constant variance across all values of x. When variance changes (heteroscedasticity), the mean becomes less informative.
•Inappropriate for Skewed Distributions: For highly skewed outcomes (e.g., income, medical costs), the mean may not represent a 'typical' observation.
•Risk Assessment Impossible: Without modeling different quantiles, we cannot assess probabilities of extreme events.

A Motivating Example:

If an insurer sets premiums based only on the mean, they might:

Overcharge typical patients (mean > median)
Underestimate capital reserves needed for high-cost cases
Miss crucial risk factors that affect the tails differently than the center

Quantile regression provides a richer picture by estimating the entire conditional distribution.

A Complementary Approach

Quantiles and Their Properties

Before constructing the quantile loss function, let's establish rigorous definitions and properties of quantiles themselves.

Definition (Quantile):

For a random variable $Y$ with cumulative distribution function (CDF) $F_Y(y) = P(Y \leq y)$, the $\tau$-th quantile $q_\tau$ is defined as:

$$q_\tau = F_Y^{-1}(\tau) = \inf{y : F_Y(y) \geq \tau}$$

for $\tau \in (0, 1)$.

Key Properties of Quantiles:

Fundamental Quantile Properties

•Probability Interpretation: Exactly $\tau$ fraction of the distribution lies at or below $q_\tau$. Thus $P(Y \leq q_{0.25}) = 0.25$ for the first quartile.
•Monotonicity: If $\tau_1 < \tau_2$, then $q_{\tau_1} \leq q_{\tau_2}$. Higher quantiles correspond to larger values.
•Location Equivariance: For any constant $a$, the $\tau$-th quantile of $Y + a$ equals $q_\tau + a$.
•Scale Equivariance: For $c > 0$, the $\tau$-th quantile of $cY$ equals $c \cdot q_\tau$.
•Generalized Inverse: The quantile function handles discrete distributions and distributions with flat regions in the CDF.

Common Quantiles and Their Names:

Standard Quantile Terminology
Quantile Level τ	Name	Interpretation
τ = 0.5	Median	The middle value; 50% of observations lie below
τ = 0.25, 0.75	Quartiles (Q1, Q3)	Divide data into four equal parts
τ = 0.1, 0.9	Deciles (D1, D9)	Mark 10% and 90% thresholds
τ = 0.01, 0.99	Percentiles (P1, P99)	Extreme quantiles for tail analysis
τ = 0.05	Lower 5th Percentile	Common in risk assessment (VaR)

Conditional Quantiles:

In regression, we're interested in conditional quantiles—the quantiles of $Y$ given $X = x$:

$$Q_\tau(Y \mid X = x) = \inf{y : P(Y \leq y \mid X = x) \geq \tau}$$

For the conditional mean, we have $\mathbb{E}[Y \mid X = x]$. Quantile regression generalizes this to:

$$Q_\tau(Y \mid X = x) = x^\top \beta(\tau)$$

where $\beta(\tau)$ is the coefficient vector for quantile $\tau$. Note that the coefficients themselves depend on $\tau$—different quantiles have different regression coefficients!

Quantile-Specific Coefficients

Deriving the Quantile Loss Function

The choice of loss function determines what aspect of the conditional distribution we estimate. OLS uses squared loss, which minimizes to the mean. What loss function yields quantiles?

The Key Insight:

To estimate the $\tau$-th quantile, we need a loss function that:

Penalizes underestimation and overestimation differently
Achieves its minimum at the $\tau$-th population quantile
Weights errors asymmetrically based on $\tau$

The Check Function (Quantile Loss):

The quantile loss function (also called the check function or tilted absolute value function) for quantile level $\tau \in (0, 1)$ is defined as:

$$\rho_\tau(u) = u \cdot (\tau - \mathbb{1}{u < 0})$$

where $\mathbb{1}{u < 0}$ is the indicator function that equals 1 if $u < 0$ and 0 otherwise.

Equivalent piecewise formulation:

$$\rho_\tau(u) = \begin{cases} \tau \cdot u & \text{if } u \geq 0 \ (\tau - 1) \cdot u & \text{if } u < 0 \end{cases}$$

Since $\tau - 1$ is negative and $u < 0$ in the second case, this makes $(\tau - 1) \cdot u$ positive, as required for a loss function.

Understanding the Asymmetry:

Let's analyze how the loss penalizes positive errors (underestimation: $y > \hat{y}$, so $u = y - \hat{y} > 0$) versus negative errors (overestimation: $y < \hat{y}$, so $u < 0$):

Positive residuals $u > 0$: penalty = $\tau \cdot |u|$
Negative residuals $u < 0$: penalty = $(1 - \tau) \cdot |u|$

For $\tau = 0.9$ (estimating the 90th percentile):

Underestimation ($u > 0$): penalty = $0.9 |u|$
Overestimation ($u < 0$): penalty = $0.1 |u|$

Asymmetric Penalties for Different Quantiles
τ	Underestimation Penalty (u > 0)	Overestimation Penalty (u < 0)	Effect
0.10	0.10 × \|u\|	0.90 × \|u\|	Strongly penalizes overestimation; estimates lower quantile
0.25	0.25 × \|u\|	0.75 × \|u\|	Penalizes overestimation more; estimates Q1
0.50	0.50 × \|u\|	0.50 × \|u\|	Symmetric penalty → estimates median
0.75	0.75 × \|u\|	0.25 × \|u\|	Penalizes underestimation more; estimates Q3
0.90	0.90 × \|u\|	0.10 × \|u\|	Strongly penalizes underestimation; estimates upper quantile

The Median as a Special Case

Mathematical Properties of Quantile Loss

The quantile loss function possesses several mathematical properties that make it suitable for optimization and statistical inference.

Property 1: Convexity

The quantile loss $\rho_\tau(u)$ is a convex function of $u$. This is easily verified:

It is piecewise linear with slopes $\tau$ (for $u \geq 0$) and $\tau - 1$ (for $u < 0$)
Since $\tau - 1 < 0 < \tau$ for $\tau \in (0,1)$, the function has a 'V' shape with the vertex at $u = 0$
A piecewise linear function with increasing slopes is convex

Implication: The quantile regression objective function is convex in the parameters, guaranteeing a unique global minimum (or a convex set of minimizers).

Property 2: Population Minimizer

The $\tau$-th population quantile minimizes the expected quantile loss. Formally:

$$q_\tau = \arg\min_q \mathbb{E}[\rho_\tau(Y - q)]$$

Proof Sketch:

Take the derivative (in the distributional sense) of $\mathbb{E}[\rho_\tau(Y - q)]$ with respect to $q$:

$$\frac{d}{dq} \mathbb{E}[\rho_\tau(Y - q)] = -\tau P(Y > q) + (1 - \tau) P(Y < q)$$

$$= -\tau(1 - F(q)) + (1 - \tau)F(q) = F(q) - \tau$$

Setting this to zero gives $F(q) = \tau$, hence $q = F^{-1}(\tau) = q_\tau$. □

Property 3: Subgradient Structure

At $u = 0$, the quantile loss is not differentiable. However, it has a well-defined subgradient:

$$\partial \rho_\tau(u) = \begin{cases} {\tau} & \text{if } u > 0 \ [\tau - 1, \tau] & \text{if } u = 0 \ {\tau - 1} & \text{if } u < 0 \end{cases}$$

The subgradient at zero is the interval $[\tau - 1, \tau]$, which always contains zero when $\tau \in (0, 1)$. This structure is exploited by optimization algorithms.

Property 4: Robustness

Unlike squared loss, which grows quadratically with $|u|$, the quantile loss grows only linearly. This bounded influence function provides:

Natural robustness to outliers
Bounded impact from any single observation
Similar behavior to robust M-estimators

Connection to Robust Statistics

Comparison with Squared Loss

Understanding the differences between quantile loss and squared loss illuminates when each is appropriate.

The Two Loss Functions Side by Side:

$$L_{squared}(u) = u^2 \qquad \text{vs.} \qquad \rho_\tau(u) = u(\tau - \mathbb{1}{u < 0})$$

Squared Loss (L₂)

•Symmetric: equal penalty for over/underestimate
•Quadratic growth: extreme errors dominate
•Differentiable everywhere
•Minimizer: conditional mean E[Y|X]
•Sensitive to outliers
•Requires finite second moment
•Gaussian MLE interpretation

Quantile Loss (ρτ)

•Asymmetric: weighted by τ vs. (1-τ)
•Linear growth: bounded influence
•Non-differentiable at u = 0
•Minimizer: conditional τ-quantile
•Robust to outliers
•Requires only finite first moment
•No parametric distribution assumed

Geometric Interpretation:

When to Choose Each:

Use Squared Loss (OLS) when:

Errors are roughly symmetric and Gaussian
The conditional mean is the quantity of interest
Computational efficiency is paramount (closed-form solution)
Homoscedasticity is a reasonable assumption

Use Quantile Loss when:

Modeling different parts of the distribution
Data contains outliers or heavy-tailed noise
Heteroscedasticity is present (variance changes with x)
Risk or tail behavior is the primary concern
Distribution-free inference is desired

The Non-Differentiability Challenge

Visualization and Building Intuition

Visual understanding of the quantile loss function solidifies these concepts.

quantile_loss_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
import matplotlib.pyplot as plt
 
def quantile_loss(u, tau):
    """
    Compute the quantile loss (check function) for residual u and quantile level tau.
    
    Parameters:
    -----------
    u : array-like
        Residuals (y - y_hat)
    tau : float
        Quantile level in (0, 1)
    
    Returns:
    --------
    array-like
        Quantile loss values
    """
    return u * (tau - (u < 0).astype(float))
 
# Alternative equivalent formulation
def quantile_loss_explicit(u, tau):
    """Explicit piecewise formulation for clarity."""
    return np.where(u >= 0, tau * u, (tau - 1) * u)
 
# Generate residual values
u = np.linspace(-3, 3, 1000)
 
# Compute losses for different quantile levels
tau_values = [0.1, 0.25, 0.5, 0.75, 0.9]
 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Plot 1: Quantile loss for different tau values
ax1 = axes[0]
for tau in tau_values:
    loss = quantile_loss(u, tau)
    ax1.plot(u, loss, label=f'τ = {tau}', linewidth=2)
 
ax1.set_xlabel('Residual u = y - ŷ', fontsize=12)
ax1.set_ylabel('Loss ρ_τ(u)', fontsize=12)
ax1.set_title('Quantile Loss Functions for Different τ', fontsize=14)
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.axhline(y=0, color='k', linewidth=0.5)
ax1.axvline(x=0, color='k', linewidth=0.5)
 
# Plot 2: Compare squared loss vs quantile loss (τ = 0.5)
ax2 = axes[1]
squared_loss = u ** 2
median_loss = quantile_loss(u, 0.5)
absolute_loss = np.abs(u)
 
ax2.plot(u, squared_loss, label='Squared Loss (L₂)', linewidth=2)
ax2.plot(u, absolute_loss, label='Absolute Loss (L₁)', linewidth=2, linestyle='--')
ax2.plot(u, 2 * median_loss, label='Quantile Loss (τ=0.5) × 2', linewidth=2, linestyle=':')
 
ax2.set_xlabel('Residual u', fontsize=12)
ax2.set_ylabel('Loss', fontsize=12)
ax2.set_title('Squared vs Absolute (Quantile τ=0.5) Loss', fontsize=14)
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_ylim(0, 5)
 
plt.tight_layout()
plt.show()
 
# Demonstrate asymmetry numerically
print("\nAsymmetric Penalties for τ = 0.9:")
print("-" * 50)
for residual in [-2, -1, -0.5, 0.5, 1, 2]:
    loss = quantile_loss(residual, 0.9)
    error_type = "Overestimate" if residual < 0 else "Underestimate"
    print(f"Residual u = {residual:+.1f}: Loss = {loss:.2f} ({error_type})")

Key Observations from the Visualization:

Asymmetry increases as τ departs from 0.5: For τ = 0.9, the left branch is nearly flat while the right branch is steep.
All curves pass through the origin: Zero residual incurs zero loss, regardless of τ.
Median loss equals half the absolute error: The τ = 0.5 curve matches |u|/2.
Mirror symmetry: ρ_τ(u) and ρ_{1-τ}(-u) are identical—estimating the τ-quantile with one orientation is equivalent to estimating the (1-τ)-quantile with reversed sign.

The Formal Quantile Regression Optimization Problem

With the quantile loss function established, we can now formulate the quantile regression optimization problem.

Problem Statement:

Given data ${(x_i, y_i)}_{i=1}^{n}$, where $x_i \in \mathbb{R}^p$ and $y_i \in \mathbb{R}$, the $\tau$-quantile regression estimator is:

$$\hat{\beta}(\tau) = \arg\min_{\beta \in \mathbb{R}^p} \sum_{i=1}^{n} \rho_\tau(y_i - x_i^\top \beta)$$

Expanding the loss function:

$$\hat{\beta}(\tau) = \arg\min_{\beta} \left[ \tau \sum_{i: y_i \geq x_i^\top \beta} |y_i - x_i^\top \beta| + (1 - \tau) \sum_{i: y_i < x_i^\top \beta} |y_i - x_i^\top \beta| \right]$$

Properties of the Estimator:

Existence: A minimizer always exists (the objective is continuous and coercive).
Uniqueness: The minimizer is generally unique when $X^\top X$ is full rank, but the objective is not strictly convex, so uniqueness isn't guaranteed in degenerate cases.
Robustness: The estimator has bounded influence function, providing resistance to outliers.
Equivariance: The estimator is:
- Location equivariant: Shifting $y$ by $a$ shifts the fitted values by $a$
- Scale equivariant: Scaling $y$ by $c > 0$ scales the coefficients and fitted values by $c$

Interpretation of Coefficients:

For the linear quantile regression model:

$$Q_\tau(Y \mid X = x) = x^\top \beta(\tau)$$

The coefficient $\beta_j(\tau)$ represents:

The change in the τ-th conditional quantile of Y for a one-unit increase in $x_j$, holding other variables constant.

This contrasts with OLS, where $\beta_j$ represents the change in the conditional mean.

Different τ, Different Stories

Summary and Looking Ahead

We have established the mathematical foundation of quantile regression through the quantile loss function.

Key Takeaways

•Mean regression has limitations — It provides only one summary of the conditional distribution and is sensitive to outliers and distributional asymmetry.
•Quantiles characterize distributions — The τ-quantile is the value below which τ fraction of the distribution falls, providing a complete picture of the conditional distribution.
•The quantile loss function is asymmetric — It penalizes errors by τ or (1-τ) depending on sign, pushing the minimizer toward the τ-th quantile.
•Convexity enables optimization — Despite non-differentiability at zero, the convex structure ensures global optima exist.
•Robustness is built-in — Linear growth of loss provides bounded influence, making quantile regression naturally robust to outliers.
•Coefficients have quantile-specific interpretations — β(τ) describes the effect of covariates on the τ-th conditional quantile, not the mean.

What's Next:

Page Complete

1 / 5