Loading content...
In the previous page, we introduced the check function $\rho_\tau(u)$ as the loss function underlying quantile regression. Now we'll explore an equivalent formulation—the pinball loss—which offers additional insights and proves invaluable for practical implementation.
The term "pinball loss" derives from the graph of the function: its V-shape (asymmetric at $\tau \neq 0.5$) resembles the trajectory of a pinball bouncing off angled flippers. This vivid metaphor captures the essence of asymmetric penalization.
Why another formulation?
While mathematically equivalent to the check function, the pinball loss perspective:
By the end of this page, you will master the pinball loss formulation, understand its role as a proper scoring rule for quantile forecasts, see how it enables gradient computation, and appreciate its applications in modern machine learning systems.
Let's establish the formal definition and demonstrate its equivalence to the check function.
Definition (Pinball Loss):
For a quantile level $\tau \in (0, 1)$, the pinball loss between the true value $y$ and the predicted quantile $\hat{q}$ is:
$$L_{\tau}(y, \hat{q}) = \begin{cases} \tau (y - \hat{q}) & \text{if } y \geq \hat{q} \ (1 - \tau)(\hat{q} - y) & \text{if } y < \hat{q} \end{cases}$$
Compact Notation:
Using the residual $e = y - \hat{q}$:
$$L_{\tau}(y, \hat{q}) = \max{\tau \cdot e, (\tau - 1) \cdot e}$$
Or equivalently using indicator functions:
$$L_{\tau}(y, \hat{q}) = (\tau - \mathbb{1}{y < \hat{q}})(y - \hat{q})$$
Proof of Equivalence to Check Function:
Recall the check function: $$\rho_\tau(u) = u(\tau - \mathbb{1}{u < 0})$$
With $u = y - \hat{q}$:
This matches the pinball loss definition exactly. The two formulations are algebraically identical. □
Why "Pinball"?
The name comes from visualizing the loss as a function of the residual $e = y - \hat{q}$:
The asymmetric V-shape, with the vertex at $e = 0$, resembles the trajectory of a ball bouncing off tilted surfaces.
| Scenario | τ = 0.1 | τ = 0.5 | τ = 0.9 |
|---|---|---|---|
| y = 10, q̂ = 8 (underest.) | 0.1 × 2 = 0.2 | 0.5 × 2 = 1.0 | 0.9 × 2 = 1.8 |
| y = 10, q̂ = 10 (exact) | 0 | 0 | 0 |
| y = 10, q̂ = 12 (overest.) | 0.9 × 2 = 1.8 | 0.5 × 2 = 1.0 | 0.1 × 2 = 0.2 |
For τ = 0.9, underestimation is penalized 9× more than overestimation. This makes sense: a good 90th percentile estimate should rarely be exceeded—only about 10% of observations should fall above it.
The pinball loss belongs to a special class of evaluation metrics called proper scoring rules—a concept from decision theory that provides theoretical justification for using pinball loss.
Definition (Scoring Rule):
A scoring rule $S(F, y)$ assigns a numerical score based on:
Definition (Proper Scoring Rule):
A scoring rule is proper if the true distribution minimizes the expected score. Formally, for a random variable $Y \sim G$:
$$\mathbb{E}_G[S(G, Y)] \leq \mathbb{E}_G[S(F, Y)] \quad \text{for all } F$$
with equality if and only if $F = G$ (for strictly proper rules).
Why Properness Matters:
Proper scoring rules incentivize honest forecasts. A forecaster minimizing expected score is driven to report their true belief. Improper rules can incentivize strategic distortion.
Theorem (Pinball Loss is Proper for Quantiles):
For any distribution $G$ and quantile level $\tau$, the pinball loss is minimized in expectation when $\hat{q} = G^{-1}(\tau)$, the true $\tau$-quantile of $G$.
Proof:
Let $Y \sim G$. We seek to minimize:
$$\mathbb{E}G[L\tau(Y, \hat{q})] = \tau \int_{\hat{q}}^{\infty} (y - \hat{q}) , dG(y) + (1 - \tau) \int_{-\infty}^{\hat{q}} (\hat{q} - y) , dG(y)$$
Differentiating with respect to $\hat{q}$:
$$\frac{d}{d\hat{q}} \mathbb{E}[L_\tau(Y, \hat{q})] = -\tau(1 - G(\hat{q})) + (1 - \tau)G(\hat{q})$$ $$= G(\hat{q}) - \tau$$
Setting to zero: $G(\hat{q}) = \tau$, hence $\hat{q} = G^{-1}(\tau)$. □
Implications:
If you minimize pinball loss over a large representative sample, the resulting quantile predictions will be well-calibrated: approximately τ fraction of observations will fall below your τ-quantile predictions. This is a powerful guarantee that squared loss cannot provide.
For gradient-based optimization (essential for neural networks and large-scale problems), we need to compute derivatives of the pinball loss.
The Challenge: Non-Differentiability
The pinball loss $L_\tau(y, \hat{q})$ has a kink at $y = \hat{q}$, making it non-differentiable at this point. However, this is a set of measure zero, and we can use subgradients.
Subgradient of Pinball Loss:
The subgradient with respect to $\hat{q}$ is:
$$\partial_{\hat{q}} L_\tau(y, \hat{q}) = \begin{cases} -\tau & \text{if } y > \hat{q} \ [-\tau, 1-\tau] & \text{if } y = \hat{q} \ 1 - \tau & \text{if } y < \hat{q} \end{cases}$$
Practical Gradient:
For implementation, we typically use:
$$\frac{\partial L_\tau}{\partial \hat{q}} = (\mathbb{1}{y < \hat{q}} - \tau)$$
This is:
Intuition: If the true value exceeds our prediction (underestimate), the gradient is negative, pushing $\hat{q}$ upward. If we overestimate, the gradient is positive, pushing $\hat{q}$ downward. The asymmetry determines how strongly.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import numpy as npimport torchimport torch.nn as nn class PinballLoss(nn.Module): """ PyTorch implementation of Pinball Loss for quantile regression. This is the standard loss function for training neural networks to predict specific quantiles of the response distribution. """ def __init__(self, tau: float): """ Initialize pinball loss for quantile level tau. Parameters: ----------- tau : float Quantile level in (0, 1). E.g., 0.5 for median, 0.9 for 90th percentile. """ super().__init__() if not 0 < tau < 1: raise ValueError(f"tau must be in (0, 1), got {tau}") self.tau = tau def forward(self, y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor: """ Compute pinball loss. Parameters: ----------- y_pred : torch.Tensor Predicted quantile values, shape (batch_size,) or (batch_size, 1) y_true : torch.Tensor True target values, shape (batch_size,) or (batch_size, 1) Returns: -------- torch.Tensor Scalar mean pinball loss """ residual = y_true - y_pred loss = torch.where( residual >= 0, self.tau * residual, (self.tau - 1) * residual # (tau - 1) * residual = (1 - tau) * |residual| ) return loss.mean() class MultiQuantileLoss(nn.Module): """ Pinball loss for predicting multiple quantiles simultaneously. Enables a single network to output predictions for multiple quantile levels, which is useful for estimating prediction intervals. """ def __init__(self, quantiles: list): """ Parameters: ----------- quantiles : list of float List of quantile levels, e.g., [0.1, 0.5, 0.9] """ super().__init__() self.quantiles = torch.tensor(quantiles) def forward(self, y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor: """ Compute average pinball loss across all quantiles. Parameters: ----------- y_pred : torch.Tensor Predicted quantiles, shape (batch_size, num_quantiles) y_true : torch.Tensor True values, shape (batch_size,) or (batch_size, 1) """ if y_true.dim() == 1: y_true = y_true.unsqueeze(1) # y_true: (batch, 1), y_pred: (batch, num_quantiles) residuals = y_true - y_pred # (batch, num_quantiles) # tau values for each quantile: (num_quantiles,) tau = self.quantiles.to(y_pred.device) losses = torch.where( residuals >= 0, tau * residuals, (tau - 1) * residuals ) return losses.mean() # Example usageif __name__ == "__main__": # Single quantile prediction y_true = torch.tensor([10.0, 12.0, 8.0, 15.0]) y_pred = torch.tensor([9.0, 13.0, 8.5, 12.0]) for tau in [0.1, 0.5, 0.9]: loss_fn = PinballLoss(tau) loss = loss_fn(y_pred, y_true) print(f"τ = {tau}: Pinball Loss = {loss.item():.4f}") print("\n--- Multi-Quantile Prediction ---") # Multi-quantile: predict 10th, 50th, 90th percentiles quantiles = [0.1, 0.5, 0.9] y_true_multi = torch.tensor([10.0, 12.0, 8.0, 15.0]) y_pred_multi = torch.tensor([ [8.0, 10.0, 14.0], # predictions for observation 1 [10.0, 12.0, 16.0], # predictions for observation 2 [6.0, 8.0, 12.0], # predictions for observation 3 [12.0, 14.0, 18.0], # predictions for observation 4 ]) multi_loss_fn = MultiQuantileLoss(quantiles) multi_loss = multi_loss_fn(y_pred_multi, y_true_multi) print(f"Multi-Quantile Loss: {multi_loss.item():.4f}")The PyTorch implementation enables quantile regression with neural networks. Simply replace MSELoss with PinballLoss to predict any desired quantile. For uncertainty estimation, use MultiQuantileLoss to predict multiple quantiles (e.g., 10th, 50th, 90th) in a single forward pass.
The non-differentiability of pinball loss at $y = \hat{q}$ can cause issues for some optimization algorithms (e.g., Newton's method, L-BFGS). Several smooth approximations exist:
1. Huber-Style Smoothing:
Create a quadratic region near zero:
$$L_\tau^{\delta}(e) = \begin{cases} \rho_\tau(e) & \text{if } |e| > \delta \ \frac{e^2}{2\delta} + (2\tau - 1)\frac{e}{2} + \frac{\delta}{4}(2\tau - 1)^2 & \text{if } |e| \leq \delta \end{cases}$$
where $e = y - \hat{q}$ and $\delta > 0$ is a smoothing parameter.
2. Log-Sum-Exp Approximation:
Using the smooth maximum function:
$$L_\tau^{\text{smooth}}(e) \approx \alpha \log(e^{\tau e / \alpha} + e^{(\tau - 1)e / \alpha})$$
where $\alpha > 0$ controls smoothness (smaller $\alpha$ → closer to true pinball).
3. Rectified Linear Combination:
Express pinball as: $$L_\tau(e) = \tau \max(e, 0) + (1 - \tau) \max(-e, 0)$$
Then smooth each ReLU using softplus: $\text{softplus}(x) = \log(1 + e^x)$.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import numpy as npimport matplotlib.pyplot as plt def pinball_loss(e, tau): """Standard pinball loss.""" return np.where(e >= 0, tau * e, (tau - 1) * e) def smooth_pinball_logsumexp(e, tau, alpha=0.1): """ Smooth approximation using log-sum-exp. alpha -> 0 recovers exact pinball loss alpha -> inf gives linear average of the two branches """ term1 = tau * e / alpha term2 = (tau - 1) * e / alpha # Use logsumexp trick for numerical stability max_term = np.maximum(term1, term2) return alpha * (max_term + np.log(np.exp(term1 - max_term) + np.exp(term2 - max_term))) def smooth_pinball_softplus(e, tau, beta=10): """ Smooth approximation using softplus. Softplus(x) = log(1 + exp(x)) ≈ max(0, x) beta controls sharpness: higher beta -> closer to exact """ softplus = lambda x: np.log(1 + np.exp(np.clip(beta * x, -50, 50))) / beta return tau * softplus(e) + (1 - tau) * softplus(-e) def smooth_pinball_huber(e, tau, delta=0.1): """ Huber-style smooth pinball loss. Quadratic near zero, linear outside [-delta, delta]. """ abs_e = np.abs(e) sign_factor = np.where(e >= 0, tau, tau - 1) linear_part = sign_factor * e quadratic_part = (e**2 / (2 * delta)) + (2*tau - 1) * (e / 2) + (delta / 4) * (2*tau - 1)**2 return np.where(abs_e > delta, linear_part, quadratic_part) # Visualizatione = np.linspace(-2, 2, 500)tau = 0.75 fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Plot 1: LogSumExp smoothingax1 = axes[0]ax1.plot(e, pinball_loss(e, tau), 'k-', linewidth=2, label='Exact Pinball')for alpha in [0.5, 0.2, 0.05]: ax1.plot(e, smooth_pinball_logsumexp(e, tau, alpha), '--', label=f'α = {alpha}', linewidth=1.5)ax1.set_xlabel('Residual e')ax1.set_ylabel('Loss')ax1.set_title('Log-Sum-Exp Smoothing')ax1.legend()ax1.grid(True, alpha=0.3) # Plot 2: Softplus smoothingax2 = axes[1]ax2.plot(e, pinball_loss(e, tau), 'k-', linewidth=2, label='Exact Pinball')for beta in [2, 5, 20]: ax2.plot(e, smooth_pinball_softplus(e, tau, beta), '--', label=f'β = {beta}', linewidth=1.5)ax2.set_xlabel('Residual e')ax2.set_ylabel('Loss')ax2.set_title('Softplus Smoothing')ax2.legend()ax2.grid(True, alpha=0.3) # Plot 3: Huber smoothingax3 = axes[2]ax3.plot(e, pinball_loss(e, tau), 'k-', linewidth=2, label='Exact Pinball')for delta in [0.5, 0.2, 0.05]: ax3.plot(e, smooth_pinball_huber(e, tau, delta), '--', label=f'δ = {delta}', linewidth=1.5)ax3.set_xlabel('Residual e')ax3.set_ylabel('Loss')ax3.set_title('Huber-Style Smoothing')ax3.legend()ax3.grid(True, alpha=0.3) plt.tight_layout()plt.show()Smooth approximations are useful when: • Using second-order optimizers (Newton, L-BFGS) • Requiring continuous gradients for theoretical analysis • Dealing with numerical instabilities
For first-order methods (SGD, Adam), the exact pinball loss works fine with subgradients.
Pinball loss plays a central role in probabilistic forecasting—the practice of predicting entire distributions rather than point estimates.
The Quantile Forecast Framework:
In many forecasting applications (energy demand, weather, finance), users need not just a single prediction but an understanding of uncertainty. Quantile forecasting provides this by predicting multiple quantiles:
$${\hat{q}{\tau_1}, \hat{q}{\tau_2}, \ldots, \hat{q}_{\tau_K}}$$
For example, $\tau \in {0.1, 0.25, 0.5, 0.75, 0.9}$ provides five points on the predictive CDF.
Pinball Loss for Forecast Evaluation:
The aggregate pinball loss across quantiles serves as a comprehensive measure of forecast quality:
$$\text{CRPS}{\text{quantile}} = \frac{1}{n} \sum{i=1}^{n} \frac{1}{K} \sum_{k=1}^{K} L_{\tau_k}(y_i, \hat{q}_{\tau_k, i})$$
This approximates the Continuous Ranked Probability Score (CRPS)—the gold standard for evaluating probabilistic forecasts.
CRPS and Its Relationship to Pinball Loss:
The CRPS for a predictive CDF $F$ and observation $y$ is:
$$\text{CRPS}(F, y) = \int_0^1 2 \cdot L_\tau(y, F^{-1}(\tau)) , d\tau$$
This beautiful result shows that CRPS is the integrated pinball loss over all quantile levels. Minimizing average pinball loss across many quantiles approximates CRPS minimization.
Applications in Industry:
| Domain | Use Case | Key Quantiles |
|---|---|---|
| Energy | Load forecasting for grid management | τ = 0.1, 0.5, 0.9 for demand uncertainty |
| Finance | Value at Risk (VaR) estimation | τ = 0.01, 0.05 for tail risk |
| Retail | Inventory optimization | τ = 0.7-0.95 for safety stock |
| Weather | Temperature and precipitation forecasting | Full quantile spectrum |
| Healthcare | Patient outcome prediction | Lower quantiles for worst-case planning |
| Supply Chain | Lead time prediction | Upper quantiles for buffer sizing |
A good probabilistic forecast isn't just about low pinball loss—it must also be calibrated. If you predict the 90th percentile, exactly 90% of observations should fall below it. Pinball loss encourages calibration, but you should also verify it empirically using calibration plots.
Pinball loss connects to several other important loss functions in machine learning.
1. L1 Loss (Median):
For $\tau = 0.5$: $$L_{0.5}(y, \hat{q}) = 0.5|y - \hat{q}| = \frac{1}{2} L_1(y, \hat{q})$$
Median regression using pinball loss is equivalent (up to scaling) to Least Absolute Deviations (LAD).
2. Huber Loss:
Huber loss combines L1 and L2: $$L_\delta(e) = \begin{cases} \frac{1}{2}e^2 & |e| \leq \delta \ \delta(|e| - \frac{\delta}{2}) & |e| > \delta \end{cases}$$
Pinball loss can be made "Huber-like" by adding smoothing near zero while preserving asymmetry.
3. Hinge Loss (SVM):
The SVM hinge loss has a similar piecewise linear structure: $$L_{hinge}(y, f) = \max(0, 1 - yf)$$
Both are convex, non-smooth, and define optimal solutions at boundary points.
4. Expectile Loss:
A less common alternative for asymmetric regression: $$L_{\tau}^{expectile}(e) = |\tau - \mathbb{1}{e < 0}| \cdot e^2$$
Expectile loss is smooth (differentiable) but estimates expectiles, not quantiles.
| Property | Pinball (Quantile) | Expectile | Asymmetric Squared |
|---|---|---|---|
| Estimand | τ-quantile | τ-expectile | Weighted mean |
| Growth | Linear in |e| | Quadratic in |e| | Quadratic in |e| |
| Smoothness | Non-smooth at e=0 | Smooth everywhere | Smooth everywhere |
| Robustness | High (bounded influence) | Low | Low |
| Optimization | Linear programming | Standard gradient | Standard gradient |
| Interpretation | Clear probability | Less intuitive | Depends on weights |
Expectiles are the least-squares analog of quantiles. While quantiles divide probability mass (τ fraction below), expectiles are defined by the asymmetrically weighted mean condition. Expectiles are always interior to the data range and more sensitive to outliers, but enjoy smoothness advantages.
Successful use of pinball loss requires attention to several practical details.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import numpy as npfrom sklearn.linear_model import QuantileRegressorimport matplotlib.pyplot as plt # Generate synthetic heteroscedastic datanp.random.seed(42)n = 500X = np.random.uniform(0, 10, n)# Variance increases with X (heteroscedasticity)noise = np.random.normal(0, 1 + 0.5 * X, n)y = 2 * X + 5 + noise X = X.reshape(-1, 1) # Fit quantile regression models for different tauquantiles = [0.1, 0.25, 0.5, 0.75, 0.9]models = {}predictions = {} for tau in quantiles: model = QuantileRegressor(quantile=tau, alpha=0, solver='highs') model.fit(X, y) models[tau] = model predictions[tau] = model.predict(X) # Calibration checkprint("Calibration Check:")print("-" * 40)for tau in quantiles: fraction_below = np.mean(y < predictions[tau]) print(f"τ = {tau}: {fraction_below:.3f} (expected: {tau})") # VisualizationX_sort_idx = np.argsort(X.ravel())X_sorted = X.ravel()[X_sort_idx] plt.figure(figsize=(12, 6))plt.scatter(X, y, alpha=0.3, s=20, label='Data') colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(quantiles)))for tau, color in zip(quantiles, colors): pred_sorted = predictions[tau][X_sort_idx] plt.plot(X_sorted, pred_sorted, color=color, linewidth=2, label=f'τ = {tau}') plt.xlabel('X')plt.ylabel('y')plt.title('Quantile Regression with Heteroscedastic Data')plt.legend(loc='upper left')plt.grid(True, alpha=0.3)plt.show() print("\nNote: Fan-shaped quantile predictions correctly capture")print("the increasing variance as X increases.")When fitting separate models for each quantile, predictions may 'cross'—e.g., the 90th percentile prediction falling below the 10th. This violates quantile monotonicity. Solutions include: (1) joint quantile estimation, (2) post-hoc sorting, or (3) architectures that enforce monotonicity by construction.
We have explored the pinball loss function from multiple perspectives—computational, theoretical, and practical.
What's Next:
In the next page, we'll explore conditional quantiles in depth—how quantile regression estimates the entire conditional distribution $Q_\tau(Y | X)$, interprets coefficients, and handles different data scenarios. We'll see how the same covariates can have dramatically different effects at different quantiles.
You now understand the pinball loss function as both a computational tool and a proper scoring rule. This foundation enables principled quantile forecasting across domains—from energy to finance to healthcare. Next, we'll see how pinball loss reveals the full conditional distribution through estimation of conditional quantiles.