Loading learning content...
Squared loss is elegant, well-behaved, and computationally convenient. But it has an Achilles' heel: extreme sensitivity to outliers. A single data point with a prediction error of 100 contributes as much to the loss as 10,000 points with errors of 1. In real-world datasets—riddled with measurement errors, data entry mistakes, and genuine anomalies—this sensitivity can be catastrophic.
Enter the absolute loss (also called L1 loss, LAD loss for Least Absolute Deviation, or Mean Absolute Error when averaged). Instead of squaring the error, we simply take its absolute value:
$$L(y, \hat{y}) = |y - \hat{y}|$$
This seemingly minor change has profound consequences. The absolute loss grows linearly with error magnitude—not quadratically. An outlier with error 100 contributes 100 to the loss, not 10,000. This robustness comes at a price: the loss function is not smooth, introducing mathematical and computational challenges that require careful handling.
By the end of this page, you will understand absolute loss in depth—its definition, gradient (and subgradient at zero), connection to the conditional median, robustness properties, and implementation considerations in gradient boosting. You'll learn when absolute loss is preferable to squared loss and how to handle its non-differentiability.
The absolute loss function measures the absolute difference between predicted and actual values:
$$L(y, \hat{y}) = |y - \hat{y}|$$
For a dataset of $n$ observations, the total loss is:
$$\mathcal{L}(F) = \sum_{i=1}^{n} |y_i - F(x_i)|$$
Key Properties:
1. Non-negativity: $L(y, \hat{y}) \geq 0$, with equality only when $\hat{y} = y$.
2. Symmetry: Like squared loss, the penalty is symmetric: $L(y, y+\epsilon) = L(y, y-\epsilon)$.
3. Convexity: The function is convex (though not strictly convex everywhere), ensuring a global minimum exists.
4. Non-smoothness: The function has a "kink" at $y = \hat{y}$ where it's not differentiable. This is the source of mathematical complexity.
5. Linear Growth: Errors grow linearly with magnitude, unlike the quadratic growth of squared loss.
| Residual (y - ŷ) | Absolute Loss | Squared Loss | Ratio (Squared/Absolute) |
|---|---|---|---|
| ±1 | 1 | 0.5 | 0.5× |
| ±2 | 2 | 2 | 1× |
| ±5 | 5 | 12.5 | 2.5× |
| ±10 | 10 | 50 | 5× |
| ±100 | 100 | 5,000 | 50× |
| ±1,000 | 1,000 | 500,000 | 500× |
Notice how the ratio grows with residual size. For an outlier with error 1,000, squared loss penalizes it 500× more than absolute loss. This means fitting that single outlier perfectly would reduce squared loss by 500,000, but only reduce absolute loss by 1,000. Squared loss will sacrifice many small improvements to reduce one large error; absolute loss treats all improvements more equally.
Visual Comparison:
Imagine the loss surface as a function of prediction $\hat{y}$ for fixed $y$:
Squared loss: A smooth parabola with minimum at $\hat{y} = y$. The curvature (second derivative = 1) is constant everywhere.
Absolute loss: A V-shaped function with its vertex at $\hat{y} = y$. Constant slope of -1 for $\hat{y} < y$ and +1 for $\hat{y} > y$. The "kink" at the vertex is where the derivative doesn't exist.
This V-shape has two important implications:
For gradient boosting to work, we need the gradient of the loss with respect to predictions. Let's compute it carefully.
Case 1: $\hat{y} < y$ (underprediction)
$$L = y - \hat{y}$$ $$\frac{\partial L}{\partial \hat{y}} = -1$$
Case 2: $\hat{y} > y$ (overprediction)
$$L = \hat{y} - y$$ $$\frac{\partial L}{\partial \hat{y}} = +1$$
Case 3: $\hat{y} = y$ (exact match)
The derivative doesn't exist! The function has a corner.
Combining Cases:
We can write this compactly using the sign function:
$$\frac{\partial L}{\partial \hat{y}} = \text{sign}(\hat{y} - y) = \begin{cases} -1 & \text{if } \hat{y} < y \ 0 & \text{if } \hat{y} = y \ +1 & \text{if } \hat{y} > y \end{cases}$$
The choice of 0 at $\hat{y} = y$ is the standard convention, though any value in $[-1, 1]$ is a valid subgradient.
A subgradient generalizes the derivative to non-smooth convex functions. At a kink, multiple "slopes" are valid—any value between the left and right derivatives. For |x| at x=0, any value in [-1, 1] is a subgradient. Optimization algorithms work with any valid subgradient, though the choice can affect convergence speed.
The Pseudo-Residual for Absolute Loss:
In gradient boosting, we fit to the negative gradient:
$$r_i = -\frac{\partial L}{\partial F(x_i)} = -\text{sign}(F(x_i) - y_i) = \text{sign}(y_i - F(x_i))$$
Unlike squared loss where residuals are continuous values, absolute loss residuals are discrete: they're always -1, 0, or +1!
Interpretation:
The magnitude of the error doesn't matter—only its direction. A prediction that's off by 1 receives the same pseudo-residual as one that's off by 1,000. This is the source of both the robustness (outliers don't dominate) and the challenge (we lose information about error magnitude).
When building trees for gradient boosting, we need to determine the optimal prediction for each leaf. For absolute loss, this requires solving:
$$w_j^* = \arg\min_w \sum_{i \in I_j} |y_i - (F_{m-1}(x_i) + w)|$$
Let $r_i = y_i - F_{m-1}(x_i)$ be the residual. We're finding:
$$w_j^* = \arg\min_w \sum_{i \in I_j} |r_i - w|$$
Theorem: The value $w^*$ that minimizes the sum of absolute deviations is the median of the values ${r_i : i \in I_j}$.
Proof Sketch:
Consider moving $w$ from below the median to above it. Each residual $r_i < w$ contributes $w - r_i$ to the loss, and each $r_i > w$ contributes $r_i - w$.
If more points are above $w$ than below, moving up decreases total loss. If more are below, moving down decreases loss. The optimum is where these balance—the median!
This is why absolute loss is called "median regression." While squared loss gives us the conditional mean E[Y|X], absolute loss gives us the conditional median Med[Y|X]. The median is far more robust to outliers than the mean—exactly what we want.
Computational Implications:
Computing the median is O(n) using selection algorithms (or O(n log n) with sorting), compared to O(n) for the mean. This makes leaf value computation slightly more expensive.
Weighted Median:
In practice, samples may have weights (from subsampling, cross-validation, etc.). The optimal leaf value becomes the weighted median: the value where cumulative weight on each side is equal.
Example:
Suppose a leaf contains residuals: ${-5, -2, 1, 3, 100}$
The outlier (100) pulls the mean far from most data points. The median is unaffected. This illustrates why absolute loss is robust.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
import numpy as npfrom sklearn.tree import DecisionTreeRegressor class AbsoluteLossGradientBoosting: """ Gradient Boosting for regression using absolute loss (LAD). Key differences from squared loss: 1. Pseudo-residuals are signs (-1, 0, +1), not residuals 2. Leaf values are medians, not means 3. Model targets conditional median, not conditional mean """ def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3): self.n_estimators = n_estimators self.learning_rate = learning_rate self.max_depth = max_depth self.trees = [] self.initial_prediction = None def _absolute_loss(self, y_true, y_pred): """Compute absolute loss: L = |y - yhat|""" return np.mean(np.abs(y_true - y_pred)) def _negative_gradient(self, y_true, y_pred): """ Compute negative gradient of absolute loss. For absolute loss: -dL/d(yhat) = sign(y - yhat) This is just the sign of the residual! """ residuals = y_true - y_pred return np.sign(residuals) def _compute_leaf_values(self, tree, X, residuals): """ Compute optimal leaf values as median of residuals. This is different from squared loss where we use mean. """ # Get leaf indices for each sample leaf_indices = tree.apply(X) unique_leaves = np.unique(leaf_indices) # Compute median residual for each leaf leaf_values = {} for leaf in unique_leaves: mask = leaf_indices == leaf leaf_residuals = residuals[mask] leaf_values[leaf] = np.median(leaf_residuals) return leaf_values, leaf_indices def fit(self, X, y): """Fit gradient boosting ensemble with absolute loss.""" n_samples = len(y) # Initialize with the median of targets (not mean!) self.initial_prediction = np.median(y) # Current ensemble predictions F = np.full(n_samples, self.initial_prediction) print("Gradient Boosting Training with Absolute Loss") print("=" * 60) print(f"Initial prediction (median): {self.initial_prediction:.4f}") print(f"Initial loss: {self._absolute_loss(y, F):.4f}") print() for m in range(self.n_estimators): # Step 1: Compute actual residuals (for leaf values) residuals = y - F # Step 2: Compute pseudo-residuals (signs) for tree fitting pseudo_residuals = self._negative_gradient(y, F) # Step 3: Fit tree to pseudo-residuals (signs) tree = DecisionTreeRegressor(max_depth=self.max_depth) tree.fit(X, pseudo_residuals) # Step 4: Compute optimal leaf values (median of actual residuals) leaf_values, leaf_indices = self._compute_leaf_values( tree, X, residuals ) # Step 5: Make predictions using median leaf values predictions = np.array([leaf_values[leaf] for leaf in leaf_indices]) # Step 6: Update ensemble F += self.learning_rate * predictions # Store tree and leaf values self.trees.append((tree, leaf_values)) # Track progress if (m + 1) % 20 == 0 or m == 0: loss = self._absolute_loss(y, F) print(f"Iteration {m+1:3d}: MAE = {loss:.6f}") print() print(f"Final MAE: {self._absolute_loss(y, F):.6f}") return self def predict(self, X): """Generate predictions.""" predictions = np.full(len(X), self.initial_prediction) for tree, leaf_values in self.trees: leaf_indices = tree.apply(X) tree_predictions = np.array([ leaf_values.get(leaf, 0) for leaf in leaf_indices ]) predictions += self.learning_rate * tree_predictions return predictions # Demonstration with outliersif __name__ == "__main__": from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingRegressor # Generate data X, y = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42) # Add outliers outlier_indices = np.random.choice(len(y), 50, replace=False) y[outlier_indices] += np.random.choice([-1, 1], 50) * 200 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) print("\nComparison: Squared Loss vs Absolute Loss with Outliers") print("=" * 60) # Train with squared loss (sklearn) sq_model = GradientBoostingRegressor(loss='squared_error', n_estimators=100) sq_model.fit(X_train, y_train) sq_mae = np.mean(np.abs(y_test - sq_model.predict(X_test))) # Train with absolute loss (sklearn) abs_model = GradientBoostingRegressor(loss='absolute_error', n_estimators=100) abs_model.fit(X_train, y_train) abs_mae = np.mean(np.abs(y_test - abs_model.predict(X_test))) print(f"\nSquared Loss Model - Test MAE: {sq_mae:.4f}") print(f"Absolute Loss Model - Test MAE: {abs_mae:.4f}") print(f"Improvement: {(sq_mae - abs_mae) / sq_mae * 100:.1f}%")Just as squared loss estimtes the conditional mean, absolute loss estimates the conditional median.
Theorem: Among all functions of $X$, the conditional median minimizes expected absolute error:
$$\text{Med}[Y|X] = \arg\min_{f(X)} \mathbb{E}[|Y - f(X)|]$$
Why Median Instead of Mean?
The median has a remarkable property: it's the value that minimizes the sum of absolute deviations. This is why:
Robustness: The median is a robust estimator. Moving one observation to infinity changes the mean unboundedly but doesn't change the median.
Breakdown Point: The median has a breakdown point of 50%—you can corrupt up to half the data before the estimate breaks down. The mean has a breakdown point of 0%—a single outlier can move it arbitrarily.
Skewed Distributions: For skewed distributions, the median better represents the "typical" value than the mean.
Predicting the median is often more useful than the mean in practice. If you predict house prices, the median prediction gives a price where half the actual prices are above and half below. If predicting delivery times, the median tells you the time by which 50% of deliveries arrive. This is often what users actually want.
Connection to Quantile Regression:
Absolute loss is a special case of quantile loss, used to predict the $\tau$-th quantile of the distribution:
$$L_\tau(y, \hat{y}) = \begin{cases} \tau(y - \hat{y}) & \text{if } y > \hat{y} \ (1-\tau)(\hat{y} - y) & \text{if } y \leq \hat{y} \end{cases}$$
For $\tau = 0.5$ (the median), this simplifies to:
$$L_{0.5}(y, \hat{y}) = 0.5|y - \hat{y}|$$
which is just half the absolute loss. Different $\tau$ values let you predict different quantiles:
This extends absolute loss to quantile regression, useful for prediction intervals and risk assessment.
| Scenario | Use Mean (Squared Loss) | Use Median (Absolute Loss) |
|---|---|---|
| Outliers present | ❌ Sensitive to outliers | ✅ Robust to outliers |
| Symmetric distribution | ✅ Mean = Median | ✅ Mean = Median |
| Skewed distribution | Mean pulled by tail | ✅ Represents typical value |
| Quadratic cost of errors | ✅ Appropriate | ❌ Use squared loss |
| Linear cost of errors | ❌ Use absolute loss | ✅ Appropriate |
| Interpretability | Mean is intuitive | ✅ 50% above, 50% below |
Robustness can be formalized through influence functions, which measure how much an estimator changes when we add an observation at a particular value.
Influence Function for Mean (Squared Loss):
$$IF(y; \bar{y}) = y - \bar{y}$$
The influence grows unboundedly as $|y|$ increases. An observation at $y = 10^9$ has an influence proportional to $10^9$.
Influence Function for Median (Absolute Loss):
$$IF(y; \text{med}) = \text{sign}(y - \text{med})$$
The influence is bounded—it's always either -1, 0, or +1, regardless of how extreme $y$ is. An observation at $y = 10^9$ has the same influence as one at $y = 2 \cdot \text{med}$.
Practical Implications:
When training with absolute loss:
Robustness comes at a cost. By ignoring error magnitude, absolute loss may be inefficient when there are no outliers and all points deserve attention proportional to their error. If your data is clean and Gaussian, squared loss remains optimal. Absolute loss shines when you suspect contamination.
Empirical Comparison:
Consider training on data with 5% outliers (residuals 10× larger than typical):
| Metric | Squared Loss Model | Absolute Loss Model |
|---|---|---|
| MSE on clean test | Higher (pulled by outliers) | Lower |
| MAE on clean test | Higher | Lower |
| MSE on outliers | Lower (optimized for them) | Higher |
| Overall MAE | Higher | Lower |
Squared loss sacrifices accuracy on the majority to reduce error on outliers. Absolute loss does the opposite—it maintains accuracy on the majority at the cost of larger errors on outliers.
Which is better depends on your application:
Absolute loss introduces computational challenges not present in squared loss:
1. Non-differentiability at Zero:
When $\hat{y} = y$, the gradient is undefined. Implementations must handle this:
2. Leaf Value Computation:
Finding the median is O(n log n) with sorting or O(n) with selection algorithms:\n
Most implementations use quickselect, falling back to sorting for small leaves.
3. No Closed-Form Hessian:
The second derivative is zero everywhere except at zero, where it's undefined (a Dirac delta function). XGBoost-style second-order methods cannot directly use absolute loss.
Workaround: Use a small constant (e.g., 1) as the Hessian approximation, which makes the algorithm behave like first-order gradient descent.
| Operation | Squared Loss | Absolute Loss |
|---|---|---|
| Gradient | O(1) per sample | O(1) per sample |
| Hessian | O(1) total | Undefined (use constant) |
| Leaf value | O(n) mean | O(n) median (with selection) |
| Numerical stability | Excellent | Good (handle zero residuals) |
| Second-order methods | ✅ Full support | ⚠️ Approximation needed |
Scikit-learn's GradientBoostingRegressor supports loss='absolute_error'. XGBoost supports 'reg:absoluteerror' but it's less optimized than squared loss. LightGBM uses 'mae' (mean absolute error). All major libraries handle the computational challenges internally.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npfrom sklearn.ensemble import GradientBoostingRegressorimport lightgbm as lgbimport xgboost as xgb def compare_implementations(X_train, y_train, X_test, y_test): """ Compare absolute loss implementations across libraries. Note the different parameter names and behaviors. """ results = {} # Scikit-learn sklearn_model = GradientBoostingRegressor( loss='absolute_error', # or 'lad' in older versions n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42 ) sklearn_model.fit(X_train, y_train) results['sklearn'] = np.mean(np.abs(y_test - sklearn_model.predict(X_test))) # LightGBM lgb_model = lgb.LGBMRegressor( objective='mae', # Mean Absolute Error n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42 ) lgb_model.fit(X_train, y_train) results['lightgbm'] = np.mean(np.abs(y_test - lgb_model.predict(X_test))) # XGBoost xgb_model = xgb.XGBRegressor( objective='reg:absoluteerror', n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42 ) xgb_model.fit(X_train, y_train) results['xgboost'] = np.mean(np.abs(y_test - xgb_model.predict(X_test))) print("Absolute Loss - Library Comparison") print("=" * 40) for lib, mae in results.items(): print(f"{lib:12s}: MAE = {mae:.4f}") return results # Helper function for median computationdef weighted_median(values, weights=None): """ Compute weighted median efficiently. This is what gradient boosting does internally for leaf values. """ if weights is None: return np.median(values) # Sort by values sorted_indices = np.argsort(values) sorted_values = values[sorted_indices] sorted_weights = weights[sorted_indices] # Find cumulative weight cumsum = np.cumsum(sorted_weights) cutoff = sorted_weights.sum() / 2.0 # Return value where cumulative weight crosses 50% return sorted_values[np.searchsorted(cumsum, cutoff)]The constant gradient magnitude (always ±1) leads to different convergence behavior compared to squared loss.
Early Iterations:
With squared loss, residuals start large and provide strong gradients. As predictions improve, gradients shrink, naturally reducing the step size near convergence.
With absolute loss, gradients are always ±1 regardless of residual size. Early iterations don't automatically take larger steps.
Near Convergence:
Squared loss: Gradients shrink to zero as predictions approach targets. Convergence is smooth.
Absolute loss: Gradients remain ±1 even for small residuals. The algorithm may "oscillate" around the optimum, taking steps that overshoot.
Implications:
Learning rate matters more: With constant gradients, the learning rate solely determines step size. Too large → oscillation. Too small → slow convergence.
Line search helps: Adaptive step size selection (line search) is more valuable for absolute loss than squared loss.
Early stopping: Since the loss doesn't plateau as smoothly, early stopping based on validation loss is crucial.
When using absolute loss, start with a smaller learning rate than you would for squared loss (e.g., 0.05 instead of 0.1). The constant gradient magnitude means the effective step size doesn't decrease near convergence, so a smaller starting rate helps avoid oscillation.
Loss Curve Characteristics:
Squared loss training curves often show smooth exponential decay. Absolute loss curves can be:
Best Practices for Absolute Loss:
We've thoroughly explored the absolute loss function as a robust alternative to squared loss. Let's consolidate the key insights:
What's Next:
Squared loss is ideal for clean data; absolute loss handles contamination. But what if you want the best of both? The next page introduces Huber Loss—a smooth compromise that behaves like squared loss for small errors (efficiency) and like absolute loss for large errors (robustness). This hybrid approach often achieves better results than either extreme alone.
You now understand absolute loss as a fundamentally different approach to regression—targeting robustness over efficiency, medians over means, and bounded influence over unbounded sensitivity. This knowledge prepares you for the Huber loss, which elegantly combines the strengths of both approaches.