Machine LearningBoosting Theory

Loss Functions for Boosting

LevelAdvanced

Duration75 mins

TopicBoosting Theory

2 / 5

Absolute Loss

The Robust Alternative

Squared loss is elegant, well-behaved, and computationally convenient. But it has an Achilles' heel: extreme sensitivity to outliers. A single data point with a prediction error of 100 contributes as much to the loss as 10,000 points with errors of 1. In real-world datasets—riddled with measurement errors, data entry mistakes, and genuine anomalies—this sensitivity can be catastrophic.

Enter the absolute loss (also called L1 loss, LAD loss for Least Absolute Deviation, or Mean Absolute Error when averaged). Instead of squaring the error, we simply take its absolute value:

$$L(y, \hat{y}) = |y - \hat{y}|$$

This seemingly minor change has profound consequences. The absolute loss grows linearly with error magnitude—not quadratically. An outlier with error 100 contributes 100 to the loss, not 10,000. This robustness comes at a price: the loss function is not smooth, introducing mathematical and computational challenges that require careful handling.

What You Will Learn

By the end of this page, you will understand absolute loss in depth—its definition, gradient (and subgradient at zero), connection to the conditional median, robustness properties, and implementation considerations in gradient boosting. You'll learn when absolute loss is preferable to squared loss and how to handle its non-differentiability.

Definition and Basic Properties

The absolute loss function measures the absolute difference between predicted and actual values:

$$L(y, \hat{y}) = |y - \hat{y}|$$

For a dataset of $n$ observations, the total loss is:

$$\mathcal{L}(F) = \sum_{i=1}^{n} |y_i - F(x_i)|$$

Key Properties:

1. Non-negativity: $L(y, \hat{y}) \geq 0$, with equality only when $\hat{y} = y$.

2. Symmetry: Like squared loss, the penalty is symmetric: $L(y, y+\epsilon) = L(y, y-\epsilon)$.

3. Convexity: The function is convex (though not strictly convex everywhere), ensuring a global minimum exists.

4. Non-smoothness: The function has a "kink" at $y = \hat{y}$ where it's not differentiable. This is the source of mathematical complexity.

5. Linear Growth: Errors grow linearly with magnitude, unlike the quadratic growth of squared loss.

Absolute vs Squared Loss Penalties
Residual (y - ŷ)	Absolute Loss	Squared Loss	Ratio (Squared/Absolute)
±1	1	0.5	0.5×
±2	2	2	1×
±5	5	12.5	2.5×
±10	10	50	5×
±100	100	5,000	50×
±1,000	1,000	500,000	500×

The Outlier Difference

Notice how the ratio grows with residual size. For an outlier with error 1,000, squared loss penalizes it 500× more than absolute loss. This means fitting that single outlier perfectly would reduce squared loss by 500,000, but only reduce absolute loss by 1,000. Squared loss will sacrifice many small improvements to reduce one large error; absolute loss treats all improvements more equally.

Visual Comparison:

Imagine the loss surface as a function of prediction $\hat{y}$ for fixed $y$:

Squared loss: A smooth parabola with minimum at $\hat{y} = y$. The curvature (second derivative = 1) is constant everywhere.
Absolute loss: A V-shaped function with its vertex at $\hat{y} = y$. Constant slope of -1 for $\hat{y} < y$ and +1 for $\hat{y} > y$. The "kink" at the vertex is where the derivative doesn't exist.

This V-shape has two important implications:

The gradient is constant (±1) regardless of how close we are to the target
At the target, the gradient is undefined (but we're at the minimum, so this is okay)

Gradient Derivation and the Subgradient Challenge

For gradient boosting to work, we need the gradient of the loss with respect to predictions. Let's compute it carefully.

Case 1: $\hat{y} < y$ (underprediction)

$$L = y - \hat{y}$$ $$\frac{\partial L}{\partial \hat{y}} = -1$$

Case 2: $\hat{y} > y$ (overprediction)

$$L = \hat{y} - y$$ $$\frac{\partial L}{\partial \hat{y}} = +1$$

Case 3: $\hat{y} = y$ (exact match)

The derivative doesn't exist! The function has a corner.

Combining Cases:

We can write this compactly using the sign function:

$$\frac{\partial L}{\partial \hat{y}} = \text{sign}(\hat{y} - y) = \begin{cases} -1 & \text{if } \hat{y} < y \ 0 & \text{if } \hat{y} = y \ +1 & \text{if } \hat{y} > y \end{cases}$$

The choice of 0 at $\hat{y} = y$ is the standard convention, though any value in $[-1, 1]$ is a valid subgradient.

Subgradients and Convex Optimization

A subgradient generalizes the derivative to non-smooth convex functions. At a kink, multiple "slopes" are valid—any value between the left and right derivatives. For |x| at x=0, any value in [-1, 1] is a subgradient. Optimization algorithms work with any valid subgradient, though the choice can affect convergence speed.

The Pseudo-Residual for Absolute Loss:

In gradient boosting, we fit to the negative gradient:

$$r_i = -\frac{\partial L}{\partial F(x_i)} = -\text{sign}(F(x_i) - y_i) = \text{sign}(y_i - F(x_i))$$

Unlike squared loss where residuals are continuous values, absolute loss residuals are discrete: they're always -1, 0, or +1!

Interpretation:

$r_i = +1$: We're underpredicting; move prediction up
$r_i = -1$: We're overpredicting; move prediction down
$r_i = 0$: Perfect prediction; no update needed

The magnitude of the error doesn't matter—only its direction. A prediction that's off by 1 receives the same pseudo-residual as one that's off by 1,000. This is the source of both the robustness (outliers don't dominate) and the challenge (we lose information about error magnitude).

Squared Loss Residuals

•Continuous values (any real number)
•Magnitude reflects error size
•Large errors → large residuals
•Tree can learn fine-grained corrections
•Sensitive to outliers

Absolute Loss Residuals

•Discrete values (-1, 0, +1)
•Only direction, not magnitude
•All errors weighted equally
•Tree learns correction direction
•Robust to outliers

Optimal Leaf Values: The Median

When building trees for gradient boosting, we need to determine the optimal prediction for each leaf. For absolute loss, this requires solving:

$$w_j^* = \arg\min_w \sum_{i \in I_j} |y_i - (F_{m-1}(x_i) + w)|$$

Let $r_i = y_i - F_{m-1}(x_i)$ be the residual. We're finding:

$$w_j^* = \arg\min_w \sum_{i \in I_j} |r_i - w|$$

Theorem: The value $w^*$ that minimizes the sum of absolute deviations is the median of the values ${r_i : i \in I_j}$.

Proof Sketch:

Consider moving $w$ from below the median to above it. Each residual $r_i < w$ contributes $w - r_i$ to the loss, and each $r_i > w$ contributes $r_i - w$.

Moving $w$ up by $\delta$ decreases the contribution from points above $w$ by $\delta$ per point
Moving $w$ up by $\delta$ increases the contribution from points below $w$ by $\delta$ per point

If more points are above $w$ than below, moving up decreases total loss. If more are below, moving down decreases loss. The optimum is where these balance—the median!

Median as Optimal Prediction

This is why absolute loss is called "median regression." While squared loss gives us the conditional mean E[Y|X], absolute loss gives us the conditional median Med[Y|X]. The median is far more robust to outliers than the mean—exactly what we want.

Computational Implications:

Computing the median is O(n) using selection algorithms (or O(n log n) with sorting), compared to O(n) for the mean. This makes leaf value computation slightly more expensive.

Weighted Median:

In practice, samples may have weights (from subsampling, cross-validation, etc.). The optimal leaf value becomes the weighted median: the value where cumulative weight on each side is equal.

Example:

Suppose a leaf contains residuals: ${-5, -2, 1, 3, 100}$

Mean (squared loss): $(-5 - 2 + 1 + 3 + 100)/5 = 19.4$
Median (absolute loss): 1 (the middle value)

The outlier (100) pulls the mean far from most data points. The median is unaffected. This illustrates why absolute loss is robust.

absolute_loss_gradient_boosting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
class AbsoluteLossGradientBoosting:
    """
    Gradient Boosting for regression using absolute loss (LAD).
    
    Key differences from squared loss:
    1. Pseudo-residuals are signs (-1, 0, +1), not residuals
    2. Leaf values are medians, not means
    3. Model targets conditional median, not conditional mean
    """
    
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.initial_prediction = None
    
    def _absolute_loss(self, y_true, y_pred):
        """Compute absolute loss: L = |y - yhat|"""
        return np.mean(np.abs(y_true - y_pred))
    
    def _negative_gradient(self, y_true, y_pred):
        """
        Compute negative gradient of absolute loss.
        
        For absolute loss: -dL/d(yhat) = sign(y - yhat)
        
        This is just the sign of the residual!
        """
        residuals = y_true - y_pred
        return np.sign(residuals)
    
    def _compute_leaf_values(self, tree, X, residuals):
        """
        Compute optimal leaf values as median of residuals.
        
        This is different from squared loss where we use mean.
        """
        # Get leaf indices for each sample
        leaf_indices = tree.apply(X)
        unique_leaves = np.unique(leaf_indices)
        
        # Compute median residual for each leaf
        leaf_values = {}
        for leaf in unique_leaves:
            mask = leaf_indices == leaf
            leaf_residuals = residuals[mask]
            leaf_values[leaf] = np.median(leaf_residuals)
        
        return leaf_values, leaf_indices
    
    def fit(self, X, y):
        """Fit gradient boosting ensemble with absolute loss."""
        n_samples = len(y)
        
        # Initialize with the median of targets (not mean!)
        self.initial_prediction = np.median(y)
        
        # Current ensemble predictions
        F = np.full(n_samples, self.initial_prediction)
        
        print("Gradient Boosting Training with Absolute Loss")
        print("=" * 60)
        print(f"Initial prediction (median): {self.initial_prediction:.4f}")
        print(f"Initial loss: {self._absolute_loss(y, F):.4f}")
        print()
        
        for m in range(self.n_estimators):
            # Step 1: Compute actual residuals (for leaf values)
            residuals = y - F
            
            # Step 2: Compute pseudo-residuals (signs) for tree fitting
            pseudo_residuals = self._negative_gradient(y, F)
            
            # Step 3: Fit tree to pseudo-residuals (signs)
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, pseudo_residuals)
            
            # Step 4: Compute optimal leaf values (median of actual residuals)
            leaf_values, leaf_indices = self._compute_leaf_values(
                tree, X, residuals
            )
            
            # Step 5: Make predictions using median leaf values
            predictions = np.array([leaf_values[leaf] for leaf in leaf_indices])
            
            # Step 6: Update ensemble
            F += self.learning_rate * predictions
            
            # Store tree and leaf values
            self.trees.append((tree, leaf_values))
            
            # Track progress
            if (m + 1) % 20 == 0 or m == 0:
                loss = self._absolute_loss(y, F)
                print(f"Iteration {m+1:3d}: MAE = {loss:.6f}")
        
        print()
        print(f"Final MAE: {self._absolute_loss(y, F):.6f}")
        
        return self
    
    def predict(self, X):
        """Generate predictions."""
        predictions = np.full(len(X), self.initial_prediction)
        
        for tree, leaf_values in self.trees:
            leaf_indices = tree.apply(X)
            tree_predictions = np.array([
                leaf_values.get(leaf, 0) for leaf in leaf_indices
            ])
            predictions += self.learning_rate * tree_predictions
        
        return predictions
 
# Demonstration with outliers
if __name__ == "__main__":
    from sklearn.datasets import make_regression
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import GradientBoostingRegressor
    
    # Generate data
    X, y = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)
    
    # Add outliers
    outlier_indices = np.random.choice(len(y), 50, replace=False)
    y[outlier_indices] += np.random.choice([-1, 1], 50) * 200
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    print("\nComparison: Squared Loss vs Absolute Loss with Outliers")
    print("=" * 60)
    
    # Train with squared loss (sklearn)
    sq_model = GradientBoostingRegressor(loss='squared_error', n_estimators=100)
    sq_model.fit(X_train, y_train)
    sq_mae = np.mean(np.abs(y_test - sq_model.predict(X_test)))
    
    # Train with absolute loss (sklearn)  
    abs_model = GradientBoostingRegressor(loss='absolute_error', n_estimators=100)
    abs_model.fit(X_train, y_train)
    abs_mae = np.mean(np.abs(y_test - abs_model.predict(X_test)))
    
    print(f"\nSquared Loss Model - Test MAE: {sq_mae:.4f}")
    print(f"Absolute Loss Model - Test MAE: {abs_mae:.4f}")
    print(f"Improvement: {(sq_mae - abs_mae) / sq_mae * 100:.1f}%")

Statistical Interpretation: Conditional Median

Just as squared loss estimtes the conditional mean, absolute loss estimates the conditional median.

Theorem: Among all functions of $X$, the conditional median minimizes expected absolute error:

$$\text{Med}[Y|X] = \arg\min_{f(X)} \mathbb{E}[|Y - f(X)|]$$

Why Median Instead of Mean?

The median has a remarkable property: it's the value that minimizes the sum of absolute deviations. This is why:

Robustness: The median is a robust estimator. Moving one observation to infinity changes the mean unboundedly but doesn't change the median.
Breakdown Point: The median has a breakdown point of 50%—you can corrupt up to half the data before the estimate breaks down. The mean has a breakdown point of 0%—a single outlier can move it arbitrarily.
Skewed Distributions: For skewed distributions, the median better represents the "typical" value than the mean.

When You Want the Median

Predicting the median is often more useful than the mean in practice. If you predict house prices, the median prediction gives a price where half the actual prices are above and half below. If predicting delivery times, the median tells you the time by which 50% of deliveries arrive. This is often what users actually want.

Connection to Quantile Regression:

Absolute loss is a special case of quantile loss, used to predict the $\tau$-th quantile of the distribution:

$$L_\tau(y, \hat{y}) = \begin{cases} \tau(y - \hat{y}) & \text{if } y > \hat{y} \ (1-\tau)(\hat{y} - y) & \text{if } y \leq \hat{y} \end{cases}$$

For $\tau = 0.5$ (the median), this simplifies to:

$$L_{0.5}(y, \hat{y}) = 0.5|y - \hat{y}|$$

which is just half the absolute loss. Different $\tau$ values let you predict different quantiles:

$\tau = 0.1$: 10th percentile (conservative lower bound)
$\tau = 0.5$: Median (typical value)
$\tau = 0.9$: 90th percentile (conservative upper bound)

This extends absolute loss to quantile regression, useful for prediction intervals and risk assessment.

Mean vs Median: When to Use Each
Scenario	Use Mean (Squared Loss)	Use Median (Absolute Loss)
Outliers present	❌ Sensitive to outliers	✅ Robust to outliers
Symmetric distribution	✅ Mean = Median	✅ Mean = Median
Skewed distribution	Mean pulled by tail	✅ Represents typical value
Quadratic cost of errors	✅ Appropriate	❌ Use squared loss
Linear cost of errors	❌ Use absolute loss	✅ Appropriate
Interpretability	Mean is intuitive	✅ 50% above, 50% below

Robustness Analysis: Influence Functions

Robustness can be formalized through influence functions, which measure how much an estimator changes when we add an observation at a particular value.

Influence Function for Mean (Squared Loss):

$$IF(y; \bar{y}) = y - \bar{y}$$

The influence grows unboundedly as $|y|$ increases. An observation at $y = 10^9$ has an influence proportional to $10^9$.

Influence Function for Median (Absolute Loss):

$$IF(y; \text{med}) = \text{sign}(y - \text{med})$$

The influence is bounded—it's always either -1, 0, or +1, regardless of how extreme $y$ is. An observation at $y = 10^9$ has the same influence as one at $y = 2 \cdot \text{med}$.

Practical Implications:

When training with absolute loss:

Outliers contribute to the gradient with magnitude 1, same as normal points
No single point can dominate the optimization
The model focuses on getting the majority of points right
Extreme predictions don't get disproportionate correction

The Tradeoff

Robustness comes at a cost. By ignoring error magnitude, absolute loss may be inefficient when there are no outliers and all points deserve attention proportional to their error. If your data is clean and Gaussian, squared loss remains optimal. Absolute loss shines when you suspect contamination.

Empirical Comparison:

Consider training on data with 5% outliers (residuals 10× larger than typical):

Metric	Squared Loss Model	Absolute Loss Model
MSE on clean test	Higher (pulled by outliers)	Lower
MAE on clean test	Higher	Lower
MSE on outliers	Lower (optimized for them)	Higher
Overall MAE	Higher	Lower

Squared loss sacrifices accuracy on the majority to reduce error on outliers. Absolute loss does the opposite—it maintains accuracy on the majority at the cost of larger errors on outliers.

Which is better depends on your application:

If outliers represent real phenomena you care about → squared loss
If outliers are noise you want to ignore → absolute loss

Computational Considerations

Absolute loss introduces computational challenges not present in squared loss:

1. Non-differentiability at Zero:

When $\hat{y} = y$, the gradient is undefined. Implementations must handle this:

Use 0 as the subgradient when residual is exactly zero (common convention)
Add small epsilon to residuals to avoid exact zeros
Use smooth approximations (see Huber loss, next page)

2. Leaf Value Computation:

Finding the median is O(n log n) with sorting or O(n) with selection algorithms:\n

Sorting-based: Sort residuals, take middle value. Simple but slower.
Quickselect: Average O(n), worst case O(n²). Faster in practice.
Median of medians: Guaranteed O(n), complex to implement.

Most implementations use quickselect, falling back to sorting for small leaves.

3. No Closed-Form Hessian:

The second derivative is zero everywhere except at zero, where it's undefined (a Dirac delta function). XGBoost-style second-order methods cannot directly use absolute loss.

Workaround: Use a small constant (e.g., 1) as the Hessian approximation, which makes the algorithm behave like first-order gradient descent.

Computational Complexity Comparison
Operation	Squared Loss	Absolute Loss
Gradient	O(1) per sample	O(1) per sample
Hessian	O(1) total	Undefined (use constant)
Leaf value	O(n) mean	O(n) median (with selection)
Numerical stability	Excellent	Good (handle zero residuals)
Second-order methods	✅ Full support	⚠️ Approximation needed

Library Support

Scikit-learn's GradientBoostingRegressor supports loss='absolute_error'. XGBoost supports 'reg:absoluteerror' but it's less optimized than squared loss. LightGBM uses 'mae' (mean absolute error). All major libraries handle the computational challenges internally.

absolute_loss_implementations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
import lightgbm as lgb
import xgboost as xgb
 
def compare_implementations(X_train, y_train, X_test, y_test):
    """
    Compare absolute loss implementations across libraries.
    
    Note the different parameter names and behaviors.
    """
    results = {}
    
    # Scikit-learn
    sklearn_model = GradientBoostingRegressor(
        loss='absolute_error',  # or 'lad' in older versions
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        random_state=42
    )
    sklearn_model.fit(X_train, y_train)
    results['sklearn'] = np.mean(np.abs(y_test - sklearn_model.predict(X_test)))
    
    # LightGBM
    lgb_model = lgb.LGBMRegressor(
        objective='mae',  # Mean Absolute Error
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        random_state=42
    )
    lgb_model.fit(X_train, y_train)
    results['lightgbm'] = np.mean(np.abs(y_test - lgb_model.predict(X_test)))
    
    # XGBoost
    xgb_model = xgb.XGBRegressor(
        objective='reg:absoluteerror',
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        random_state=42
    )
    xgb_model.fit(X_train, y_train)
    results['xgboost'] = np.mean(np.abs(y_test - xgb_model.predict(X_test)))
    
    print("Absolute Loss - Library Comparison")
    print("=" * 40)
    for lib, mae in results.items():
        print(f"{lib:12s}: MAE = {mae:.4f}")
    
    return results
 
# Helper function for median computation
def weighted_median(values, weights=None):
    """
    Compute weighted median efficiently.
    
    This is what gradient boosting does internally for leaf values.
    """
    if weights is None:
        return np.median(values)
    
    # Sort by values
    sorted_indices = np.argsort(values)
    sorted_values = values[sorted_indices]
    sorted_weights = weights[sorted_indices]
    
    # Find cumulative weight
    cumsum = np.cumsum(sorted_weights)
    cutoff = sorted_weights.sum() / 2.0
    
    # Return value where cumulative weight crosses 50%
    return sorted_values[np.searchsorted(cumsum, cutoff)]

Convergence Behavior and Optimization Dynamics

The constant gradient magnitude (always ±1) leads to different convergence behavior compared to squared loss.

Early Iterations:

With squared loss, residuals start large and provide strong gradients. As predictions improve, gradients shrink, naturally reducing the step size near convergence.

With absolute loss, gradients are always ±1 regardless of residual size. Early iterations don't automatically take larger steps.

Near Convergence:

Squared loss: Gradients shrink to zero as predictions approach targets. Convergence is smooth.

Absolute loss: Gradients remain ±1 even for small residuals. The algorithm may "oscillate" around the optimum, taking steps that overshoot.

Implications:

Learning rate matters more: With constant gradients, the learning rate solely determines step size. Too large → oscillation. Too small → slow convergence.
Line search helps: Adaptive step size selection (line search) is more valuable for absolute loss than squared loss.
Early stopping: Since the loss doesn't plateau as smoothly, early stopping based on validation loss is crucial.

Practical Recommendation

When using absolute loss, start with a smaller learning rate than you would for squared loss (e.g., 0.05 instead of 0.1). The constant gradient magnitude means the effective step size doesn't decrease near convergence, so a smaller starting rate helps avoid oscillation.

Loss Curve Characteristics:

Squared loss training curves often show smooth exponential decay. Absolute loss curves can be:

More jagged (discrete gradients cause quantized updates)
Less smooth near convergence
Sometimes plateau then drop (median jumps when point crosses threshold)

Best Practices for Absolute Loss:

Use more trees with smaller learning rate
Enable early stopping with patience
Consider line search if available
Monitor both training and validation loss carefully
Compare against squared loss baseline

Summary: Absolute Loss in Gradient Boosting

We've thoroughly explored the absolute loss function as a robust alternative to squared loss. Let's consolidate the key insights:

Key Takeaways

•Linear Penalty: Errors are penalized linearly, not quadratically. An error of 100 is 100× worse than an error of 1 (not 10,000×).
•Gradient = sign(residual): Pseudo-residuals are discrete (-1, 0, +1). Only direction matters, not magnitude.
•Targets Conditional Median: Minimizing absolute loss estimates Med[Y|X], not E[Y|X]. The median is robust to outliers.
•Bounded Influence: No observation has influence greater than 1, regardless of how extreme. Outliers cannot dominate.
•Leaf Values = Median: Each leaf predicts the median of residuals, computed via O(n) selection algorithms.
•Non-smooth Optimization: The kink at zero requires subgradient handling. Convergence can oscillate without proper learning rate.
•Use When Data Is Contaminated: Absolute loss shines when outliers are present and you want to ignore them.

What's Next:

Squared loss is ideal for clean data; absolute loss handles contamination. But what if you want the best of both? The next page introduces Huber Loss—a smooth compromise that behaves like squared loss for small errors (efficiency) and like absolute loss for large errors (robustness). This hybrid approach often achieves better results than either extreme alone.

Page Complete

You now understand absolute loss as a fundamentally different approach to regression—targeting robustness over efficiency, medians over means, and bounded influence over unbounded sensitivity. This knowledge prepares you for the Huber loss, which elegantly combines the strengths of both approaches.

2 / 5

Loading learning content...

Machine LearningBoosting Theory

Loss Functions for Boosting

LevelAdvanced

Duration75 mins

TopicBoosting Theory

2 / 5

Absolute Loss

The Robust Alternative

Enter the absolute loss (also called L1 loss, LAD loss for Least Absolute Deviation, or Mean Absolute Error when averaged). Instead of squaring the error, we simply take its absolute value:

$$L(y, \hat{y}) = |y - \hat{y}|$$

What You Will Learn

Definition and Basic Properties

The absolute loss function measures the absolute difference between predicted and actual values:

$$L(y, \hat{y}) = |y - \hat{y}|$$

For a dataset of $n$ observations, the total loss is:

$$\mathcal{L}(F) = \sum_{i=1}^{n} |y_i - F(x_i)|$$

Key Properties:

1. Non-negativity: $L(y, \hat{y}) \geq 0$, with equality only when $\hat{y} = y$.

2. Symmetry: Like squared loss, the penalty is symmetric: $L(y, y+\epsilon) = L(y, y-\epsilon)$.

3. Convexity: The function is convex (though not strictly convex everywhere), ensuring a global minimum exists.

4. Non-smoothness: The function has a "kink" at $y = \hat{y}$ where it's not differentiable. This is the source of mathematical complexity.

5. Linear Growth: Errors grow linearly with magnitude, unlike the quadratic growth of squared loss.

Absolute vs Squared Loss Penalties
Residual (y - ŷ)	Absolute Loss	Squared Loss	Ratio (Squared/Absolute)
±1	1	0.5	0.5×
±2	2	2	1×
±5	5	12.5	2.5×
±10	10	50	5×
±100	100	5,000	50×
±1,000	1,000	500,000	500×

The Outlier Difference

Visual Comparison:

Imagine the loss surface as a function of prediction $\hat{y}$ for fixed $y$:

Squared loss: A smooth parabola with minimum at $\hat{y} = y$. The curvature (second derivative = 1) is constant everywhere.
Absolute loss: A V-shaped function with its vertex at $\hat{y} = y$. Constant slope of -1 for $\hat{y} < y$ and +1 for $\hat{y} > y$. The "kink" at the vertex is where the derivative doesn't exist.

This V-shape has two important implications:

The gradient is constant (±1) regardless of how close we are to the target
At the target, the gradient is undefined (but we're at the minimum, so this is okay)

Gradient Derivation and the Subgradient Challenge

For gradient boosting to work, we need the gradient of the loss with respect to predictions. Let's compute it carefully.

Case 1: $\hat{y} < y$ (underprediction)

$$L = y - \hat{y}$$ $$\frac{\partial L}{\partial \hat{y}} = -1$$

Case 2: $\hat{y} > y$ (overprediction)

$$L = \hat{y} - y$$ $$\frac{\partial L}{\partial \hat{y}} = +1$$

Case 3: $\hat{y} = y$ (exact match)

The derivative doesn't exist! The function has a corner.

Combining Cases:

We can write this compactly using the sign function:

$$\frac{\partial L}{\partial \hat{y}} = \text{sign}(\hat{y} - y) = \begin{cases} -1 & \text{if } \hat{y} < y \ 0 & \text{if } \hat{y} = y \ +1 & \text{if } \hat{y} > y \end{cases}$$

The choice of 0 at $\hat{y} = y$ is the standard convention, though any value in $[-1, 1]$ is a valid subgradient.

Subgradients and Convex Optimization

The Pseudo-Residual for Absolute Loss:

In gradient boosting, we fit to the negative gradient:

$$r_i = -\frac{\partial L}{\partial F(x_i)} = -\text{sign}(F(x_i) - y_i) = \text{sign}(y_i - F(x_i))$$

Unlike squared loss where residuals are continuous values, absolute loss residuals are discrete: they're always -1, 0, or +1!

Interpretation:

$r_i = +1$: We're underpredicting; move prediction up
$r_i = -1$: We're overpredicting; move prediction down
$r_i = 0$: Perfect prediction; no update needed

Squared Loss Residuals

•Continuous values (any real number)
•Magnitude reflects error size
•Large errors → large residuals
•Tree can learn fine-grained corrections
•Sensitive to outliers

Absolute Loss Residuals

•Discrete values (-1, 0, +1)
•Only direction, not magnitude
•All errors weighted equally
•Tree learns correction direction
•Robust to outliers

Optimal Leaf Values: The Median

When building trees for gradient boosting, we need to determine the optimal prediction for each leaf. For absolute loss, this requires solving:

$$w_j^* = \arg\min_w \sum_{i \in I_j} |y_i - (F_{m-1}(x_i) + w)|$$

Let $r_i = y_i - F_{m-1}(x_i)$ be the residual. We're finding:

$$w_j^* = \arg\min_w \sum_{i \in I_j} |r_i - w|$$

Theorem: The value $w^*$ that minimizes the sum of absolute deviations is the median of the values ${r_i : i \in I_j}$.

Proof Sketch:

Consider moving $w$ from below the median to above it. Each residual $r_i < w$ contributes $w - r_i$ to the loss, and each $r_i > w$ contributes $r_i - w$.

Moving $w$ up by $\delta$ decreases the contribution from points above $w$ by $\delta$ per point
Moving $w$ up by $\delta$ increases the contribution from points below $w$ by $\delta$ per point

If more points are above $w$ than below, moving up decreases total loss. If more are below, moving down decreases loss. The optimum is where these balance—the median!

Median as Optimal Prediction

Computational Implications:

Computing the median is O(n) using selection algorithms (or O(n log n) with sorting), compared to O(n) for the mean. This makes leaf value computation slightly more expensive.

Weighted Median:

In practice, samples may have weights (from subsampling, cross-validation, etc.). The optimal leaf value becomes the weighted median: the value where cumulative weight on each side is equal.

Example:

Suppose a leaf contains residuals: ${-5, -2, 1, 3, 100}$

Mean (squared loss): $(-5 - 2 + 1 + 3 + 100)/5 = 19.4$
Median (absolute loss): 1 (the middle value)

The outlier (100) pulls the mean far from most data points. The median is unaffected. This illustrates why absolute loss is robust.

absolute_loss_gradient_boosting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
class AbsoluteLossGradientBoosting:
    """
    Gradient Boosting for regression using absolute loss (LAD).
    
    Key differences from squared loss:
    1. Pseudo-residuals are signs (-1, 0, +1), not residuals
    2. Leaf values are medians, not means
    3. Model targets conditional median, not conditional mean
    """
    
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.initial_prediction = None
    
    def _absolute_loss(self, y_true, y_pred):
        """Compute absolute loss: L = |y - yhat|"""
        return np.mean(np.abs(y_true - y_pred))
    
    def _negative_gradient(self, y_true, y_pred):
        """
        Compute negative gradient of absolute loss.
        
        For absolute loss: -dL/d(yhat) = sign(y - yhat)
        
        This is just the sign of the residual!
        """
        residuals = y_true - y_pred
        return np.sign(residuals)
    
    def _compute_leaf_values(self, tree, X, residuals):
        """
        Compute optimal leaf values as median of residuals.
        
        This is different from squared loss where we use mean.
        """
        # Get leaf indices for each sample
        leaf_indices = tree.apply(X)
        unique_leaves = np.unique(leaf_indices)
        
        # Compute median residual for each leaf
        leaf_values = {}
        for leaf in unique_leaves:
            mask = leaf_indices == leaf
            leaf_residuals = residuals[mask]
            leaf_values[leaf] = np.median(leaf_residuals)
        
        return leaf_values, leaf_indices
    
    def fit(self, X, y):
        """Fit gradient boosting ensemble with absolute loss."""
        n_samples = len(y)
        
        # Initialize with the median of targets (not mean!)
        self.initial_prediction = np.median(y)
        
        # Current ensemble predictions
        F = np.full(n_samples, self.initial_prediction)
        
        print("Gradient Boosting Training with Absolute Loss")
        print("=" * 60)
        print(f"Initial prediction (median): {self.initial_prediction:.4f}")
        print(f"Initial loss: {self._absolute_loss(y, F):.4f}")
        print()
        
        for m in range(self.n_estimators):
            # Step 1: Compute actual residuals (for leaf values)
            residuals = y - F
            
            # Step 2: Compute pseudo-residuals (signs) for tree fitting
            pseudo_residuals = self._negative_gradient(y, F)
            
            # Step 3: Fit tree to pseudo-residuals (signs)
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, pseudo_residuals)
            
            # Step 4: Compute optimal leaf values (median of actual residuals)
            leaf_values, leaf_indices = self._compute_leaf_values(
                tree, X, residuals
            )
            
            # Step 5: Make predictions using median leaf values
            predictions = np.array([leaf_values[leaf] for leaf in leaf_indices])
            
            # Step 6: Update ensemble
            F += self.learning_rate * predictions
            
            # Store tree and leaf values
            self.trees.append((tree, leaf_values))
            
            # Track progress
            if (m + 1) % 20 == 0 or m == 0:
                loss = self._absolute_loss(y, F)
                print(f"Iteration {m+1:3d}: MAE = {loss:.6f}")
        
        print()
        print(f"Final MAE: {self._absolute_loss(y, F):.6f}")
        
        return self
    
    def predict(self, X):
        """Generate predictions."""
        predictions = np.full(len(X), self.initial_prediction)
        
        for tree, leaf_values in self.trees:
            leaf_indices = tree.apply(X)
            tree_predictions = np.array([
                leaf_values.get(leaf, 0) for leaf in leaf_indices
            ])
            predictions += self.learning_rate * tree_predictions
        
        return predictions
 
# Demonstration with outliers
if __name__ == "__main__":
    from sklearn.datasets import make_regression
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import GradientBoostingRegressor
    
    # Generate data
    X, y = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)
    
    # Add outliers
    outlier_indices = np.random.choice(len(y), 50, replace=False)
    y[outlier_indices] += np.random.choice([-1, 1], 50) * 200
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    print("\nComparison: Squared Loss vs Absolute Loss with Outliers")
    print("=" * 60)
    
    # Train with squared loss (sklearn)
    sq_model = GradientBoostingRegressor(loss='squared_error', n_estimators=100)
    sq_model.fit(X_train, y_train)
    sq_mae = np.mean(np.abs(y_test - sq_model.predict(X_test)))
    
    # Train with absolute loss (sklearn)  
    abs_model = GradientBoostingRegressor(loss='absolute_error', n_estimators=100)
    abs_model.fit(X_train, y_train)
    abs_mae = np.mean(np.abs(y_test - abs_model.predict(X_test)))
    
    print(f"\nSquared Loss Model - Test MAE: {sq_mae:.4f}")
    print(f"Absolute Loss Model - Test MAE: {abs_mae:.4f}")
    print(f"Improvement: {(sq_mae - abs_mae) / sq_mae * 100:.1f}%")

Statistical Interpretation: Conditional Median

Just as squared loss estimtes the conditional mean, absolute loss estimates the conditional median.

Theorem: Among all functions of $X$, the conditional median minimizes expected absolute error:

$$\text{Med}[Y|X] = \arg\min_{f(X)} \mathbb{E}[|Y - f(X)|]$$

Why Median Instead of Mean?

The median has a remarkable property: it's the value that minimizes the sum of absolute deviations. This is why:

Robustness: The median is a robust estimator. Moving one observation to infinity changes the mean unboundedly but doesn't change the median.
Breakdown Point: The median has a breakdown point of 50%—you can corrupt up to half the data before the estimate breaks down. The mean has a breakdown point of 0%—a single outlier can move it arbitrarily.
Skewed Distributions: For skewed distributions, the median better represents the "typical" value than the mean.

When You Want the Median

Connection to Quantile Regression:

Absolute loss is a special case of quantile loss, used to predict the $\tau$-th quantile of the distribution:

$$L_\tau(y, \hat{y}) = \begin{cases} \tau(y - \hat{y}) & \text{if } y > \hat{y} \ (1-\tau)(\hat{y} - y) & \text{if } y \leq \hat{y} \end{cases}$$

For $\tau = 0.5$ (the median), this simplifies to:

$$L_{0.5}(y, \hat{y}) = 0.5|y - \hat{y}|$$

which is just half the absolute loss. Different $\tau$ values let you predict different quantiles:

$\tau = 0.1$: 10th percentile (conservative lower bound)
$\tau = 0.5$: Median (typical value)
$\tau = 0.9$: 90th percentile (conservative upper bound)

This extends absolute loss to quantile regression, useful for prediction intervals and risk assessment.

Mean vs Median: When to Use Each
Scenario	Use Mean (Squared Loss)	Use Median (Absolute Loss)
Outliers present	❌ Sensitive to outliers	✅ Robust to outliers
Symmetric distribution	✅ Mean = Median	✅ Mean = Median
Skewed distribution	Mean pulled by tail	✅ Represents typical value
Quadratic cost of errors	✅ Appropriate	❌ Use squared loss
Linear cost of errors	❌ Use absolute loss	✅ Appropriate
Interpretability	Mean is intuitive	✅ 50% above, 50% below

Robustness Analysis: Influence Functions

Robustness can be formalized through influence functions, which measure how much an estimator changes when we add an observation at a particular value.

Influence Function for Mean (Squared Loss):

$$IF(y; \bar{y}) = y - \bar{y}$$

The influence grows unboundedly as $|y|$ increases. An observation at $y = 10^9$ has an influence proportional to $10^9$.

Influence Function for Median (Absolute Loss):

$$IF(y; \text{med}) = \text{sign}(y - \text{med})$$

The influence is bounded—it's always either -1, 0, or +1, regardless of how extreme $y$ is. An observation at $y = 10^9$ has the same influence as one at $y = 2 \cdot \text{med}$.

Practical Implications:

When training with absolute loss:

Outliers contribute to the gradient with magnitude 1, same as normal points
No single point can dominate the optimization
The model focuses on getting the majority of points right
Extreme predictions don't get disproportionate correction

The Tradeoff

Empirical Comparison:

Consider training on data with 5% outliers (residuals 10× larger than typical):

Metric	Squared Loss Model	Absolute Loss Model
MSE on clean test	Higher (pulled by outliers)	Lower
MAE on clean test	Higher	Lower
MSE on outliers	Lower (optimized for them)	Higher
Overall MAE	Higher	Lower

Squared loss sacrifices accuracy on the majority to reduce error on outliers. Absolute loss does the opposite—it maintains accuracy on the majority at the cost of larger errors on outliers.

Which is better depends on your application:

If outliers represent real phenomena you care about → squared loss
If outliers are noise you want to ignore → absolute loss

Computational Considerations

Absolute loss introduces computational challenges not present in squared loss:

1. Non-differentiability at Zero:

When $\hat{y} = y$, the gradient is undefined. Implementations must handle this:

Use 0 as the subgradient when residual is exactly zero (common convention)
Add small epsilon to residuals to avoid exact zeros
Use smooth approximations (see Huber loss, next page)

2. Leaf Value Computation:

Finding the median is O(n log n) with sorting or O(n) with selection algorithms:\n

Sorting-based: Sort residuals, take middle value. Simple but slower.
Quickselect: Average O(n), worst case O(n²). Faster in practice.
Median of medians: Guaranteed O(n), complex to implement.

Most implementations use quickselect, falling back to sorting for small leaves.

3. No Closed-Form Hessian:

The second derivative is zero everywhere except at zero, where it's undefined (a Dirac delta function). XGBoost-style second-order methods cannot directly use absolute loss.

Workaround: Use a small constant (e.g., 1) as the Hessian approximation, which makes the algorithm behave like first-order gradient descent.

Computational Complexity Comparison
Operation	Squared Loss	Absolute Loss
Gradient	O(1) per sample	O(1) per sample
Hessian	O(1) total	Undefined (use constant)
Leaf value	O(n) mean	O(n) median (with selection)
Numerical stability	Excellent	Good (handle zero residuals)
Second-order methods	✅ Full support	⚠️ Approximation needed

Library Support

absolute_loss_implementations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
import lightgbm as lgb
import xgboost as xgb
 
def compare_implementations(X_train, y_train, X_test, y_test):
    """
    Compare absolute loss implementations across libraries.
    
    Note the different parameter names and behaviors.
    """
    results = {}
    
    # Scikit-learn
    sklearn_model = GradientBoostingRegressor(
        loss='absolute_error',  # or 'lad' in older versions
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        random_state=42
    )
    sklearn_model.fit(X_train, y_train)
    results['sklearn'] = np.mean(np.abs(y_test - sklearn_model.predict(X_test)))
    
    # LightGBM
    lgb_model = lgb.LGBMRegressor(
        objective='mae',  # Mean Absolute Error
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        random_state=42
    )
    lgb_model.fit(X_train, y_train)
    results['lightgbm'] = np.mean(np.abs(y_test - lgb_model.predict(X_test)))
    
    # XGBoost
    xgb_model = xgb.XGBRegressor(
        objective='reg:absoluteerror',
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        random_state=42
    )
    xgb_model.fit(X_train, y_train)
    results['xgboost'] = np.mean(np.abs(y_test - xgb_model.predict(X_test)))
    
    print("Absolute Loss - Library Comparison")
    print("=" * 40)
    for lib, mae in results.items():
        print(f"{lib:12s}: MAE = {mae:.4f}")
    
    return results
 
# Helper function for median computation
def weighted_median(values, weights=None):
    """
    Compute weighted median efficiently.
    
    This is what gradient boosting does internally for leaf values.
    """
    if weights is None:
        return np.median(values)
    
    # Sort by values
    sorted_indices = np.argsort(values)
    sorted_values = values[sorted_indices]
    sorted_weights = weights[sorted_indices]
    
    # Find cumulative weight
    cumsum = np.cumsum(sorted_weights)
    cutoff = sorted_weights.sum() / 2.0
    
    # Return value where cumulative weight crosses 50%
    return sorted_values[np.searchsorted(cumsum, cutoff)]

Convergence Behavior and Optimization Dynamics

The constant gradient magnitude (always ±1) leads to different convergence behavior compared to squared loss.

Early Iterations:

With squared loss, residuals start large and provide strong gradients. As predictions improve, gradients shrink, naturally reducing the step size near convergence.

With absolute loss, gradients are always ±1 regardless of residual size. Early iterations don't automatically take larger steps.

Near Convergence:

Squared loss: Gradients shrink to zero as predictions approach targets. Convergence is smooth.

Absolute loss: Gradients remain ±1 even for small residuals. The algorithm may "oscillate" around the optimum, taking steps that overshoot.

Implications:

Learning rate matters more: With constant gradients, the learning rate solely determines step size. Too large → oscillation. Too small → slow convergence.
Line search helps: Adaptive step size selection (line search) is more valuable for absolute loss than squared loss.
Early stopping: Since the loss doesn't plateau as smoothly, early stopping based on validation loss is crucial.

Practical Recommendation

Loss Curve Characteristics:

Squared loss training curves often show smooth exponential decay. Absolute loss curves can be:

More jagged (discrete gradients cause quantized updates)
Less smooth near convergence
Sometimes plateau then drop (median jumps when point crosses threshold)

Best Practices for Absolute Loss:

Use more trees with smaller learning rate
Enable early stopping with patience
Consider line search if available
Monitor both training and validation loss carefully
Compare against squared loss baseline

Summary: Absolute Loss in Gradient Boosting

We've thoroughly explored the absolute loss function as a robust alternative to squared loss. Let's consolidate the key insights:

Key Takeaways

•Linear Penalty: Errors are penalized linearly, not quadratically. An error of 100 is 100× worse than an error of 1 (not 10,000×).
•Gradient = sign(residual): Pseudo-residuals are discrete (-1, 0, +1). Only direction matters, not magnitude.
•Targets Conditional Median: Minimizing absolute loss estimates Med[Y|X], not E[Y|X]. The median is robust to outliers.
•Bounded Influence: No observation has influence greater than 1, regardless of how extreme. Outliers cannot dominate.
•Leaf Values = Median: Each leaf predicts the median of residuals, computed via O(n) selection algorithms.
•Non-smooth Optimization: The kink at zero requires subgradient handling. Convergence can oscillate without proper learning rate.
•Use When Data Is Contaminated: Absolute loss shines when outliers are present and you want to ignore them.

What's Next:

Page Complete

2 / 5