Loading learning content...
Every regression technique you've encountered thus far—from simple linear regression to sophisticated kernel methods—shares a common objective: estimating the conditional mean. When you predict house prices or stock returns, you're implicitly asking, "What is the average value of y given x?"
But this singular focus on the mean tells only part of the story. In many real-world scenarios, knowing the average is insufficient or even misleading:
Quantile regression provides the mathematical framework to address these questions directly, allowing us to model any quantile of the conditional distribution—not just the mean.
By the end of this page, you will understand the quantile loss function—the foundation of quantile regression. You'll see how this elegantly asymmetric loss function enables estimation of any desired quantile, why it reduces to familiar cases at special values, and how it fundamentally differs from squared error loss.
To appreciate quantile regression, we must first understand what we sacrifice when focusing exclusively on conditional means.
The Standard Regression Setup:
Consider the familiar regression model:
$$y_i = f(x_i) + \varepsilon_i$$
where $y_i$ is the response, $x_i$ are features, $f(x_i)$ is the conditional mean function, and $\varepsilon_i$ is zero-mean noise. Ordinary Least Squares (OLS) estimates $f$ by minimizing:
$$\hat{f}{OLS} = \arg\min_f \sum{i=1}^{n} (y_i - f(x_i))^2$$
This estimator has a well-known property: it provides the conditional mean $\mathbb{E}[Y \mid X = x]$.
Why the mean can be inadequate:
A Motivating Example:
Consider predicting medical costs based on patient characteristics. The distribution of healthcare expenditures is notoriously right-skewed: most patients incur modest costs, but a small fraction generates extremely high bills. The conditional mean—influenced heavily by these expensive cases—may not represent what most patients will actually pay.
If an insurer sets premiums based only on the mean, they might:
Quantile regression provides a richer picture by estimating the entire conditional distribution.
Quantile regression doesn't replace mean regression—it complements it. When distributional effects vary across different parts of the outcome distribution, or when robustness to outliers is paramount, quantile regression becomes an essential tool.
Before constructing the quantile loss function, let's establish rigorous definitions and properties of quantiles themselves.
Definition (Quantile):
For a random variable $Y$ with cumulative distribution function (CDF) $F_Y(y) = P(Y \leq y)$, the $\tau$-th quantile $q_\tau$ is defined as:
$$q_\tau = F_Y^{-1}(\tau) = \inf{y : F_Y(y) \geq \tau}$$
for $\tau \in (0, 1)$.
Key Properties of Quantiles:
Common Quantiles and Their Names:
| Quantile Level τ | Name | Interpretation |
|---|---|---|
| τ = 0.5 | Median | The middle value; 50% of observations lie below |
| τ = 0.25, 0.75 | Quartiles (Q1, Q3) | Divide data into four equal parts |
| τ = 0.1, 0.9 | Deciles (D1, D9) | Mark 10% and 90% thresholds |
| τ = 0.01, 0.99 | Percentiles (P1, P99) | Extreme quantiles for tail analysis |
| τ = 0.05 | Lower 5th Percentile | Common in risk assessment (VaR) |
Conditional Quantiles:
In regression, we're interested in conditional quantiles—the quantiles of $Y$ given $X = x$:
$$Q_\tau(Y \mid X = x) = \inf{y : P(Y \leq y \mid X = x) \geq \tau}$$
For the conditional mean, we have $\mathbb{E}[Y \mid X = x]$. Quantile regression generalizes this to:
$$Q_\tau(Y \mid X = x) = x^\top \beta(\tau)$$
where $\beta(\tau)$ is the coefficient vector for quantile $\tau$. Note that the coefficients themselves depend on $\tau$—different quantiles have different regression coefficients!
Unlike OLS where a single β describes the entire conditional mean, quantile regression produces a family of coefficient vectors β(τ)—one for each quantile τ ∈ (0,1). This reveals how the effect of covariates varies across the distribution of the response.
The choice of loss function determines what aspect of the conditional distribution we estimate. OLS uses squared loss, which minimizes to the mean. What loss function yields quantiles?
The Key Insight:
To estimate the $\tau$-th quantile, we need a loss function that:
The Check Function (Quantile Loss):
The quantile loss function (also called the check function or tilted absolute value function) for quantile level $\tau \in (0, 1)$ is defined as:
$$\rho_\tau(u) = u \cdot (\tau - \mathbb{1}{u < 0})$$
where $\mathbb{1}{u < 0}$ is the indicator function that equals 1 if $u < 0$ and 0 otherwise.
Equivalent piecewise formulation:
$$\rho_\tau(u) = \begin{cases} \tau \cdot u & \text{if } u \geq 0 \ (\tau - 1) \cdot u & \text{if } u < 0 \end{cases}$$
Since $\tau - 1$ is negative and $u < 0$ in the second case, this makes $(\tau - 1) \cdot u$ positive, as required for a loss function.
Understanding the Asymmetry:
Let's analyze how the loss penalizes positive errors (underestimation: $y > \hat{y}$, so $u = y - \hat{y} > 0$) versus negative errors (overestimation: $y < \hat{y}$, so $u < 0$):
For $\tau = 0.9$ (estimating the 90th percentile):
The optimizer is pushed to make underestimation errors rare because they are penalized 9× more heavily than overestimation. This drives the estimate toward the 90th percentile—a value exceeded by only 10% of observations.
| τ | Underestimation Penalty (u > 0) | Overestimation Penalty (u < 0) | Effect |
|---|---|---|---|
| 0.10 | 0.10 × |u| | 0.90 × |u| | Strongly penalizes overestimation; estimates lower quantile |
| 0.25 | 0.25 × |u| | 0.75 × |u| | Penalizes overestimation more; estimates Q1 |
| 0.50 | 0.50 × |u| | 0.50 × |u| | Symmetric penalty → estimates median |
| 0.75 | 0.75 × |u| | 0.25 × |u| | Penalizes underestimation more; estimates Q3 |
| 0.90 | 0.90 × |u| | 0.10 × |u| | Strongly penalizes underestimation; estimates upper quantile |
When τ = 0.5, both types of errors are penalized equally, and the quantile loss reduces to half the absolute error: ρ₀.₅(u) = 0.5|u|. Minimizing this yields the conditional median—which is why median regression is also known as Least Absolute Deviations (LAD) regression.
The quantile loss function possesses several mathematical properties that make it suitable for optimization and statistical inference.
Property 1: Convexity
The quantile loss $\rho_\tau(u)$ is a convex function of $u$. This is easily verified:
Implication: The quantile regression objective function is convex in the parameters, guaranteeing a unique global minimum (or a convex set of minimizers).
Property 2: Population Minimizer
The $\tau$-th population quantile minimizes the expected quantile loss. Formally:
$$q_\tau = \arg\min_q \mathbb{E}[\rho_\tau(Y - q)]$$
Proof Sketch:
Take the derivative (in the distributional sense) of $\mathbb{E}[\rho_\tau(Y - q)]$ with respect to $q$:
$$\frac{d}{dq} \mathbb{E}[\rho_\tau(Y - q)] = -\tau P(Y > q) + (1 - \tau) P(Y < q)$$
$$= -\tau(1 - F(q)) + (1 - \tau)F(q) = F(q) - \tau$$
Setting this to zero gives $F(q) = \tau$, hence $q = F^{-1}(\tau) = q_\tau$. □
Property 3: Subgradient Structure
At $u = 0$, the quantile loss is not differentiable. However, it has a well-defined subgradient:
$$\partial \rho_\tau(u) = \begin{cases} {\tau} & \text{if } u > 0 \ [\tau - 1, \tau] & \text{if } u = 0 \ {\tau - 1} & \text{if } u < 0 \end{cases}$$
The subgradient at zero is the interval $[\tau - 1, \tau]$, which always contains zero when $\tau \in (0, 1)$. This structure is exploited by optimization algorithms.
Property 4: Robustness
Unlike squared loss, which grows quadratically with $|u|$, the quantile loss grows only linearly. This bounded influence function provides:
The linear growth of quantile loss gives it a bounded influence function—extreme observations cannot arbitrarily distort estimates. This connects quantile regression to the broader field of robust statistics, where resistance to outliers is a primary design goal.
Understanding the differences between quantile loss and squared loss illuminates when each is appropriate.
The Two Loss Functions Side by Side:
$$L_{squared}(u) = u^2 \qquad \text{vs.} \qquad \rho_\tau(u) = u(\tau - \mathbb{1}{u < 0})$$
Geometric Interpretation:
The squared loss creates a symmetric bowl that pushes the estimate toward the center of the residual distribution. The quantile loss creates an asymmetric 'V' that tilts toward one side, pushing the estimate toward a specific quantile.
When to Choose Each:
Use Squared Loss (OLS) when:
Use Quantile Loss when:
Unlike squared loss, quantile loss lacks a continuous derivative at zero. This prevents closed-form solutions and requires specialized optimization algorithms—typically linear programming or subgradient methods. We'll address these computational aspects in later pages.
Visual understanding of the quantile loss function solidifies these concepts.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npimport matplotlib.pyplot as plt def quantile_loss(u, tau): """ Compute the quantile loss (check function) for residual u and quantile level tau. Parameters: ----------- u : array-like Residuals (y - y_hat) tau : float Quantile level in (0, 1) Returns: -------- array-like Quantile loss values """ return u * (tau - (u < 0).astype(float)) # Alternative equivalent formulationdef quantile_loss_explicit(u, tau): """Explicit piecewise formulation for clarity.""" return np.where(u >= 0, tau * u, (tau - 1) * u) # Generate residual valuesu = np.linspace(-3, 3, 1000) # Compute losses for different quantile levelstau_values = [0.1, 0.25, 0.5, 0.75, 0.9] fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Quantile loss for different tau valuesax1 = axes[0]for tau in tau_values: loss = quantile_loss(u, tau) ax1.plot(u, loss, label=f'τ = {tau}', linewidth=2) ax1.set_xlabel('Residual u = y - ŷ', fontsize=12)ax1.set_ylabel('Loss ρ_τ(u)', fontsize=12)ax1.set_title('Quantile Loss Functions for Different τ', fontsize=14)ax1.legend()ax1.grid(True, alpha=0.3)ax1.axhline(y=0, color='k', linewidth=0.5)ax1.axvline(x=0, color='k', linewidth=0.5) # Plot 2: Compare squared loss vs quantile loss (τ = 0.5)ax2 = axes[1]squared_loss = u ** 2median_loss = quantile_loss(u, 0.5)absolute_loss = np.abs(u) ax2.plot(u, squared_loss, label='Squared Loss (L₂)', linewidth=2)ax2.plot(u, absolute_loss, label='Absolute Loss (L₁)', linewidth=2, linestyle='--')ax2.plot(u, 2 * median_loss, label='Quantile Loss (τ=0.5) × 2', linewidth=2, linestyle=':') ax2.set_xlabel('Residual u', fontsize=12)ax2.set_ylabel('Loss', fontsize=12)ax2.set_title('Squared vs Absolute (Quantile τ=0.5) Loss', fontsize=14)ax2.legend()ax2.grid(True, alpha=0.3)ax2.set_ylim(0, 5) plt.tight_layout()plt.show() # Demonstrate asymmetry numericallyprint("\nAsymmetric Penalties for τ = 0.9:")print("-" * 50)for residual in [-2, -1, -0.5, 0.5, 1, 2]: loss = quantile_loss(residual, 0.9) error_type = "Overestimate" if residual < 0 else "Underestimate" print(f"Residual u = {residual:+.1f}: Loss = {loss:.2f} ({error_type})")Key Observations from the Visualization:
Asymmetry increases as τ departs from 0.5: For τ = 0.9, the left branch is nearly flat while the right branch is steep.
All curves pass through the origin: Zero residual incurs zero loss, regardless of τ.
Median loss equals half the absolute error: The τ = 0.5 curve matches |u|/2.
Mirror symmetry: ρ_τ(u) and ρ_{1-τ}(-u) are identical—estimating the τ-quantile with one orientation is equivalent to estimating the (1-τ)-quantile with reversed sign.
With the quantile loss function established, we can now formulate the quantile regression optimization problem.
Problem Statement:
Given data ${(x_i, y_i)}_{i=1}^{n}$, where $x_i \in \mathbb{R}^p$ and $y_i \in \mathbb{R}$, the $\tau$-quantile regression estimator is:
$$\hat{\beta}(\tau) = \arg\min_{\beta \in \mathbb{R}^p} \sum_{i=1}^{n} \rho_\tau(y_i - x_i^\top \beta)$$
Expanding the loss function:
$$\hat{\beta}(\tau) = \arg\min_{\beta} \left[ \tau \sum_{i: y_i \geq x_i^\top \beta} |y_i - x_i^\top \beta| + (1 - \tau) \sum_{i: y_i < x_i^\top \beta} |y_i - x_i^\top \beta| \right]$$
Properties of the Estimator:
Existence: A minimizer always exists (the objective is continuous and coercive).
Uniqueness: The minimizer is generally unique when $X^\top X$ is full rank, but the objective is not strictly convex, so uniqueness isn't guaranteed in degenerate cases.
Robustness: The estimator has bounded influence function, providing resistance to outliers.
Equivariance: The estimator is:
Interpretation of Coefficients:
For the linear quantile regression model:
$$Q_\tau(Y \mid X = x) = x^\top \beta(\tau)$$
The coefficient $\beta_j(\tau)$ represents:
The change in the τ-th conditional quantile of Y for a one-unit increase in $x_j$, holding other variables constant.
This contrasts with OLS, where $\beta_j$ represents the change in the conditional mean.
By estimating β(τ) for multiple values of τ (e.g., 0.1, 0.25, 0.5, 0.75, 0.9), we can see how covariates affect different parts of the response distribution. A covariate might have large effects on upper quantiles but negligible effects on lower quantiles—information completely hidden by OLS.
We have established the mathematical foundation of quantile regression through the quantile loss function.
What's Next:
In the next page, we'll explore the pinball loss function—an alternative formulation of quantile loss that provides additional insight and computational advantages. We'll see how pinball loss connects to probabilistic forecasting, scoring rules, and modern machine learning frameworks.
You now understand the quantile loss function—the mathematical engine of quantile regression. This asymmetric loss enables estimation of conditional quantiles, providing a richer description of the relationship between features and outcomes than mean regression alone. Next, we'll formalize this through the pinball loss and explore its optimization.