Ensemble MethodsGradient Boosting

Gradient Boosting

LevelAdvanced

Duration90 mins

TopicGradient Boosting

2 / 5

Pseudo-Residuals

The Universal Learning Signal

In the previous page, we mentioned that gradient boosting fits each new base learner to pseudo-residuals—the negative gradients of the loss function at current predictions. This seemingly simple idea is what elevates gradient boosting from a specialized technique into a universal framework applicable to regression, classification, ranking, survival analysis, and virtually any supervised learning task.

But what exactly are pseudo-residuals? Why do we call them 'pseudo' rather than simply 'residuals'? And how do we compute them for different loss functions?

This page answers these questions comprehensively, providing the mathematical foundation and practical intuition for working with pseudo-residuals across diverse machine learning tasks.

What You Will Learn

By the end of this page, you will understand: the formal definition of pseudo-residuals as negative gradients, why they generalize classical residuals, how to derive pseudo-residuals for common loss functions, the geometric interpretation of gradient direction in prediction space, and practical considerations for robust computation.

From Residuals to Pseudo-Residuals

Classical Residuals

In traditional statistics, a residual is simply the difference between the observed value and the predicted value:

$$r_i = y_i - \hat{y}_i$$

For ordinary least squares regression, minimizing the sum of squared residuals yields the optimal fit. Residuals have an intuitive interpretation: they represent what the model 'missed' or failed to explain.

The Limitation of Classical Residuals

The classical residual makes sense for squared loss:

$$L(y, F) = \frac{1}{2}(y - F)^2$$

But what about other loss functions? Consider absolute loss for robust regression:

$$L(y, F) = |y - F|$$

Or logistic loss for classification:

$$L(y, F) = \log(1 + e^{-yF})$$

For these losses, the classical residual $y - F$ doesn't directly correspond to the gradient of the loss. Fitting to classical residuals would optimize squared loss regardless of the actual objective.

The Pseudo-Residual Generalization

Pseudo-residuals generalize classical residuals to arbitrary loss functions. They are defined as the negative gradient of the loss with respect to the current prediction:

$$\tilde{r}i = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \bigg|{F=F_{m-1}}$$

We call these 'pseudo' residuals because:

They generalize the concept of residuals beyond squared loss
For squared loss, they exactly equal classical residuals
For other losses, they indicate the direction of improvement, not just the prediction error

The Key Insight

Pseudo-residuals tell us HOW to change each prediction to reduce the loss, not just HOW WRONG each prediction is. This subtle distinction is what allows gradient boosting to optimize any differentiable loss function using the same base learner training procedure.

Classical Residuals vs. Pseudo-Residuals
Property	Classical Residuals	Pseudo-Residuals
Definition	$r_i = y_i - F(x_i)$	$\tilde{r}_i = -\frac{\partial L}{\partial F(x_i)}$
Loss function	Implicitly squared loss	Adapts to any differentiable loss
For squared loss	$y_i - F(x_i)$	$y_i - F(x_i)$ (identical)
For absolute loss	$y_i - F(x_i)$	$\text{sign}(y_i - F(x_i))$
Interpretation	Prediction error	Direction of steepest descent
Magnitude	Error magnitude	Gradient magnitude

Mathematical Derivation

Let's rigorously derive the pseudo-residual formula and understand its connection to gradient descent in function space.

The Functional Gradient

Consider the total loss over the training set:

$$\mathcal{L}(F) = \sum_{i=1}^{n} L(y_i, F(x_i))$$

We want to find the direction in function space that most reduces this loss. If we could perturb $F$ by a small function $\epsilon \cdot h$, the change in loss is:

$$\mathcal{L}(F + \epsilon h) - \mathcal{L}(F) \approx \epsilon \sum_{i=1}^{n} \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \cdot h(x_i)$$

This is a directional derivative. The steepest descent direction is the negative gradient:

$$-\nabla_F \mathcal{L} = \left( -\frac{\partial L(y_1, F(x_1))}{\partial F(x_1)}, \ldots, -\frac{\partial L(y_n, F(x_n))}{\partial F(x_n)} \right)$$

Connecting to Gradient Descent

In standard parameter-space gradient descent, we compute $\nabla_\theta \mathcal{L}$ and update $\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}$.

In function-space gradient descent, the 'gradient' at the training points is the vector of partial derivatives. To update the function, we'd ideally subtract this gradient:

$$F^{(t+1)}(x_i) = F^{(t)}(x_i) - \eta \cdot \frac{\partial L(y_i, F^{(t)}(x_i))}{\partial F^{(t)}(x_i)}$$

But we need predictions at unseen points too!

This is where the base learner comes in. We fit a function $h_m$ to approximate the negative gradient vector, allowing generalization beyond the training set:

$$h_m \approx \left( -\frac{\partial L}{\partial F(x_1)}, \ldots, -\frac{\partial L}{\partial F(x_n)} \right)$$

Why Fit a Learner to the Gradient?

The negative gradient tells us the optimal update direction at training points. But we need updates for test points too. By fitting a model (like a decision tree) to approximate the gradient, we learn a function that can extrapolate the gradient direction to unseen data. This is the key to generalization in gradient boosting.

The Pseudo-Residual as Target

Define the pseudo-residual for sample $i$ at iteration $m$:

$$\tilde{r}{im} = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \bigg|{F=F_{m-1}}$$

We train the base learner by minimizing:

$$h_m = \arg\min_{h \in \mathcal{H}} \sum_{i=1}^{n} (\tilde{r}_{im} - h(x_i))^2$$

This is a standard squared-loss regression problem! Regardless of the original loss function, each boosting iteration solves a regression problem. This is the elegance of gradient boosting: one unified training procedure handles all loss functions.

Geometric Interpretation

Imagine the training predictions $(F(x_1), \ldots, F(x_n))$ as a point in $n$-dimensional space. The pseudo-residual vector points from this current location toward lower loss. The base learner tries to output a function that, when added to $F$, moves predictions in this beneficial direction.

Each iteration takes a step in 'function space' toward the minimum—hence the name 'gradient descent in function space.'

Pseudo-Residuals for Common Loss Functions

Let's derive pseudo-residuals for the most commonly used loss functions in practice. Understanding these derivations builds intuition for how gradient boosting adapts to different tasks.

1. Squared Loss (L2 Loss) — Regression

$$L(y, F) = \frac{1}{2}(y - F)^2$$

Gradient: $$\frac{\partial L}{\partial F} = \frac{\partial}{\partial F}\left[\frac{1}{2}(y - F)^2\right] = -(y - F) = F - y$$

Pseudo-residual (negative gradient): $$\tilde{r} = -(F - y) = y - F$$

Interpretation: For squared loss, pseudo-residuals equal classical residuals. The target for the next tree is simply the error: what the current model got wrong. This is why early gradient boosting papers talked about 'fitting to residuals.'

pseudo_residuals_squared_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np
 
def pseudo_residuals_squared_loss(y_true, y_pred):
    """
    Pseudo-residuals for squared loss: r = y - F
    
    The model should predict higher where y > F (positive residual)
    and lower where y < F (negative residual).
    """
    return y_true - y_pred
 
# Example
y_true = np.array([3.0, 5.0, 7.0, 9.0])
y_pred = np.array([2.5, 5.5, 6.0, 10.0])
residuals = pseudo_residuals_squared_loss(y_true, y_pred)
print(f"True values:       {y_true}")
print(f"Predictions:       {y_pred}")
print(f"Pseudo-residuals:  {residuals}")
# Output: [0.5, -0.5, 1.0, -1.0]

2. Absolute Loss (L1 Loss) — Robust Regression

$$L(y, F) = |y - F|$$

Gradient: The derivative of $|x|$ is $\text{sign}(x)$ (undefined at 0, but we handle this separately): $$\frac{\partial L}{\partial F} = \frac{\partial |y - F|}{\partial F} = -\text{sign}(y - F)$$

Pseudo-residual: $$\tilde{r} = \text{sign}(y - F)$$

Interpretation: Unlike squared loss, absolute loss pseudo-residuals ignore error magnitude—only the direction matters. The target is always $+1$ or $-1$, indicating whether the model should predict higher or lower. This makes gradient boosting with L1 loss robust to outliers: a large error produces the same gradient as a small one.

pseudo_residuals_absolute_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
 
def pseudo_residuals_absolute_loss(y_true, y_pred):
    """
    Pseudo-residuals for absolute loss: r = sign(y - F)
    
    Robust to outliers: large and small errors produce
    the same magnitude gradient.
    """
    return np.sign(y_true - y_pred)
 
# Example with an outlier
y_true = np.array([3.0, 5.0, 7.0, 100.0])  # 100 is an outlier
y_pred = np.array([2.5, 5.5, 6.0, 10.0])
 
residuals_l2 = y_true - y_pred  # Squared loss residuals
residuals_l1 = pseudo_residuals_absolute_loss(y_true, y_pred)
 
print(f"L2 pseudo-residuals: {residuals_l2}")
# Output: [0.5, -0.5, 1.0, 90.0]  <-- Outlier dominates!
 
print(f"L1 pseudo-residuals: {residuals_l1}")
# Output: [1.0, -1.0, 1.0, 1.0]  <-- Outlier has same influence

3. Huber Loss — Smooth Robust Regression

$$L_\delta(y, F) = \begin{cases} \frac{1}{2}(y - F)^2 & \text{if } |y - F| \leq \delta \ \delta|y - F| - \frac{\delta^2}{2} & \text{otherwise} \end{cases}$$

Gradient: $$\frac{\partial L_\delta}{\partial F} = \begin{cases} F - y & \text{if } |y - F| \leq \delta \ -\delta \cdot \text{sign}(y - F) & \text{otherwise} \end{cases}$$

Pseudo-residual: $$\tilde{r} = \begin{cases} y - F & \text{if } |y - F| \leq \delta \ \delta \cdot \text{sign}(y - F) & \text{otherwise} \end{cases}$$

Interpretation: Huber loss blends L2 (for small errors) and L1 (for large errors). The pseudo-residual equals the error for small residuals but is capped at $\pm\delta$ for outliers. This provides a smooth transition between sensitivity (for fitting) and robustness (for outliers).

Choosing δ in Huber Loss

The threshold δ defines what counts as an 'outlier.' A common heuristic is to set δ based on the residual distribution—e.g., the 90th percentile of |y - F| so that 10% of samples are treated robustly. Some implementations tune δ adaptively during training.

4. Log Loss (Binary Cross-Entropy) — Binary Classification

For binary classification with labels $y \in {0, 1}$ and predictions as log-odds $F$ (so $P(y=1) = \sigma(F) = \frac{1}{1+e^{-F}}$):

$$L(y, F) = -y \cdot F + \log(1 + e^F) = -y \log(\sigma(F)) - (1-y)\log(1-\sigma(F))$$

Gradient: $$\frac{\partial L}{\partial F} = -y + \frac{e^F}{1+e^F} = -y + \sigma(F) = \sigma(F) - y$$

Pseudo-residual: $$\tilde{r} = y - \sigma(F) = y - P(y=1|x)$$

Interpretation: The pseudo-residual is the difference between the true label and the predicted probability. If $y=1$ but $\sigma(F)=0.2$, then $\tilde{r}=0.8$: the model should increase its prediction substantially. If $y=0$ but $\sigma(F)=0.9$, then $\tilde{r}=-0.9$: a strong signal to decrease the prediction.

This is exactly the probability residual—the analog of error in probability space. The next tree fits to these probability residuals, learning to increase or decrease predicted log-odds.

pseudo_residuals_log_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
 
def sigmoid(x):
    """Numerically stable sigmoid function."""
    return np.where(
        x >= 0,
        1 / (1 + np.exp(-x)),
        np.exp(x) / (1 + np.exp(x))
    )
 
def pseudo_residuals_log_loss(y_true, raw_predictions):
    """
    Pseudo-residuals for log loss (binary classification).
    
    Parameters:
    -----------
    y_true : array of {0, 1}
        True binary labels
    raw_predictions : array
        Current model predictions in log-odds (before sigmoid)
    
    Returns:
    --------
    residuals : array
        y - sigmoid(F) for each sample
    """
    predicted_proba = sigmoid(raw_predictions)
    return y_true - predicted_proba
 
# Example: Confident correct vs. Confident wrong
y_true = np.array([1, 1, 0, 0])
raw_predictions = np.array([2.0, -1.0, -2.0, 1.0])  # Log-odds
 
predicted_proba = sigmoid(raw_predictions)
residuals = pseudo_residuals_log_loss(y_true, raw_predictions)
 
print("True labels:      ", y_true)
print("Log-odds (F):     ", raw_predictions)
print("Probabilities:    ", np.round(predicted_proba, 3))
print("Pseudo-residuals: ", np.round(residuals, 3))
 
# Output:
# True labels:       [1 1 0 0]
# Log-odds (F):      [ 2. -1. -2.  1.]
# Probabilities:     [0.881 0.269 0.119 0.731]
# Pseudo-residuals:  [ 0.119  0.731 -0.119 -0.731]
#                     ^ small (correct, confident)
#                           ^ large (wrong, model needs to change!)

5. Pairwise Loss — Learning to Rank

For ranking tasks, we often use pairwise losses that compare items. Given a pair $(i, j)$ where item $i$ should rank higher than $j$:

$$L_{ij}(F) = \log(1 + e^{-(F(x_i) - F(x_j))})$$

Gradient with respect to $F(x_i)$: $$\frac{\partial L_{ij}}{\partial F(x_i)} = -\sigma(F(x_j) - F(x_i)) = \frac{-1}{1 + e^{F(x_i) - F(x_j)}}$$

Pseudo-residual: $$\tilde{r}i = \sum{j: i \succ j} \sigma(F(x_j) - F(x_i)) - \sum_{k: k \succ i} \sigma(F(x_i) - F(x_k))$$

Interpretation: The pseudo-residual for a document aggregates its pairwise preferences. If the document should rank high but current scores put it low, it receives a positive gradient (push its score up). The magnitude depends on how badly the ranking is violated.

A Unified View of Pseudo-Residuals

Having examined several specific cases, let's synthesize a unified understanding of pseudo-residuals and their role in gradient boosting.

Pseudo-Residuals Across Loss Functions
Loss Function	Task	Pseudo-Residual $\tilde{r}$	Intuition
Squared (L2)	Regression	$y - F$	Predict more where error is positive
Absolute (L1)	Robust regression	$\text{sign}(y - F)$	Direction only, outlier-robust
Huber	Robust regression	Capped residual	L2 for small errors, L1 for large
Log loss	Binary classification	$y - \sigma(F)$	Probability error
Multinomial	Multi-class	$y_k - p_k$ per class	Class probability error
Pairwise	Ranking	Aggregated pair comparisons	Push up/down based on rank violations
Quantile	Quantile regression	Asymmetric sign	Above/below with asymmetric weight

Universal Properties of Pseudo-Residuals

•Always point toward loss reduction: By definition, the negative gradient is the direction of steepest descent. Moving predictions in this direction improves the objective.
•Magnitude indicates improvement potential: Larger pseudo-residuals suggest greater potential for loss reduction at that point. This naturally focuses the next learner on high-value corrections.
•Adapt to the loss function's preferences: L2 loss emphasizes large errors (squared); L1 loss treats all errors equally (sign only). The pseudo-residual reflects these preferences.
•Enable unified training: Regardless of the original loss, we always fit base learners using squared loss to pseudo-residuals. One training procedure, infinite loss functions.
•Support custom losses: Any differentiable loss function yields pseudo-residuals through differentiation. This includes custom business metrics made differentiable through surrogate losses.

The Power of Abstraction

Pseudo-residuals abstract away the specifics of each loss function. The gradient boosting framework doesn't need to know whether it's doing regression, classification, or ranking—it just fits trees to pseudo-residuals. This abstraction enables code reuse and makes gradient boosting frameworks (like XGBoost) support diverse tasks with minor changes.

Geometric Interpretation

Visualizing pseudo-residuals geometrically deepens understanding. Let's think about what happens in the space of predictions.

The Prediction Space View

For $n$ training samples, the current model produces an $n$-dimensional prediction vector:

$$\mathbf{F} = (F(x_1), \ldots, F(x_n)) \in \mathbb{R}^n$$

The loss function defines a scalar surface over this space:

$$\mathcal{L}(\mathbf{F}) = \sum_{i=1}^{n} L(y_i, F_i)$$

Each training iteration: We stand at point $\mathbf{F}^{(m-1)}$, compute the gradient $\nabla \mathcal{L}$, and try to move toward lower loss.

The Gradient as Direction

The pseudo-residual vector

$$\tilde{\mathbf{r}} = -\nabla \mathcal{L} = \left( -\frac{\partial L}{\partial F_1}, \ldots, -\frac{\partial L}{\partial F_n} \right)$$

points in the direction of steepest descent. If we could update predictions freely:

$$\mathbf{F}^{(m)} = \mathbf{F}^{(m-1)} + \eta \tilde{\mathbf{r}}$$

we'd take a pure gradient descent step.

The Base Learner Constraint

But we can't update predictions arbitrarily—we need to generalize to unseen data. The base learner constrains us to updates that can be expressed as $h(x)$:

$$\mathbf{F}^{(m)} = \mathbf{F}^{(m-1)} + \eta \cdot (h_m(x_1), \ldots, h_m(x_n))$$

The base learner finds the best approximation to the gradient direction within its function class.

visualize_gradient_direction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
import matplotlib.pyplot as plt
 
def visualize_pseudo_residual_geometry():
    """
    Visualize pseudo-residuals as gradient directions in prediction space.
    
    For 2 training samples, we can plot the full loss surface and gradient.
    """
    # Two training samples with targets 1.0 and 2.0
    y = np.array([1.0, 2.0])
    
    # Create a grid of possible predictions
    F1 = np.linspace(-1, 4, 100)
    F2 = np.linspace(-1, 4, 100)
    F1_grid, F2_grid = np.meshgrid(F1, F2)
    
    # Compute squared loss surface
    loss_surface = 0.5 * ((F1_grid - y[0])**2 + (F2_grid - y[1])**2)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    # Left: Loss contours with gradient vector
    ax = axes[0]
    contour = ax.contour(F1_grid, F2_grid, loss_surface, levels=20, cmap='viridis')
    ax.clabel(contour, inline=True, fontsize=8)
    
    # Current predictions and gradient
    F_current = np.array([0.5, 1.0])  # Current predictions
    pseudo_residuals = y - F_current  # Gradient for squared loss
    
    ax.scatter(*F_current, s=100, c='red', zorder=5, label='Current F')
    ax.scatter(*y, s=100, c='green', marker='*', zorder=5, label='Target y')
    ax.arrow(F_current[0], F_current[1], 
             0.5*pseudo_residuals[0], 0.5*pseudo_residuals[1],
             head_width=0.1, head_length=0.05, fc='red', ec='red',
             label='Pseudo-residual direction')
    
    ax.set_xlabel('F(x₁)')
    ax.set_ylabel('F(x₂)')
    ax.set_title('Squared Loss Surface with Gradient')
    ax.legend()
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
    
    # Right: Iteration trajectory
    ax = axes[1]
    contour = ax.contour(F1_grid, F2_grid, loss_surface, levels=20, cmap='viridis', alpha=0.5)
    
    # Simulate gradient descent
    F_trajectory = [F_current.copy()]
    learning_rate = 0.3
    for _ in range(10):
        pseudo_res = y - F_trajectory[-1]
        F_new = F_trajectory[-1] + learning_rate * pseudo_res
        F_trajectory.append(F_new)
    
    F_trajectory = np.array(F_trajectory)
    ax.plot(F_trajectory[:, 0], F_trajectory[:, 1], 'ro-', markersize=8, linewidth=2)
    ax.scatter(*y, s=150, c='green', marker='*', zorder=5, label='Target (optimum)')
    
    ax.set_xlabel('F(x₁)')
    ax.set_ylabel('F(x₂)')
    ax.set_title('Gradient Descent Trajectory')
    ax.legend()
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('pseudo_residual_geometry.png', dpi=150)
    plt.show()
 
visualize_pseudo_residual_geometry()

Projection onto the Model Space

The key insight is that fitting a base learner to pseudo-residuals is equivalent to projecting the gradient onto the subspace of functions representable by the base learner.

In linear algebra terms:

The full gradient $\tilde{\mathbf{r}}$ lives in $\mathbb{R}^n$
The base learner outputs span a subspace $\mathcal{S} \subseteq \mathbb{R}^n$
The fitted base learner $h_m$ represents the projection of $\tilde{\mathbf{r}}$ onto $\mathcal{S}$

Decision trees project onto the subspace of piecewise-constant functions. Deeper trees span a larger subspace, allowing better gradient approximation but risking overfitting.

Why Shallow Trees Often Work Best

Deep trees can approximate the gradient perfectly on training data. But perfect gradient approximation at each step leads to overfitting—the model memorizes noise. Shallow trees force a coarse approximation, acting as implicit regularization. Each step makes progress but doesn't perfectly fit the gradient, leaving room for subsequent iterations.

Practical Computation Considerations

Computing pseudo-residuals correctly is crucial for gradient boosting implementations. Let's examine practical issues that arise in real-world settings.

Numerical Stability

For some loss functions, naive gradient computation can cause numerical issues:

Log loss with extreme predictions: If $F$ is very large or very negative, $\sigma(F)$ approaches 1 or 0, and $\log(\sigma(F))$ or $\log(1-\sigma(F))$ can overflow.

Solution: Use numerically stable formulations:

stable_pseudo_residuals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
 
def stable_sigmoid(x):
    """
    Numerically stable sigmoid that avoids overflow.
    """
    # For x >= 0: 1 / (1 + exp(-x)) is stable
    # For x < 0: exp(x) / (1 + exp(x)) is stable  
    return np.where(
        x >= 0,
        1.0 / (1.0 + np.exp(-x)),
        np.exp(x) / (1.0 + np.exp(x))
    )
 
def stable_log_loss_pseudo_residuals(y_true, raw_predictions, clip_value=1e-15):
    """
    Compute pseudo-residuals for log loss with numerical safeguards.
    
    Parameters:
    -----------
    y_true : array of {0, 1}
    raw_predictions : array of raw scores (log-odds)
    clip_value : float
        Minimum probability to prevent log(0)
    """
    # Stable sigmoid computation
    proba = stable_sigmoid(raw_predictions)
    
    # Clip probabilities to prevent extreme gradients
    proba = np.clip(proba, clip_value, 1.0 - clip_value)
    
    # Pseudo-residual: y - p
    return y_true - proba
 
# Test with extreme values
raw_preds = np.array([-100, -10, 0, 10, 100])
proba = stable_sigmoid(raw_preds)
print(f"Raw predictions: {raw_preds}")
print(f"Stable sigmoid:  {proba}")
# Output: [0., 0.0000454, 0.5, 0.9999546, 1.]
# Note: Doesn't overflow even for extreme values

Handling Edge Cases

Zero gradients: Some samples may have zero pseudo-residuals (perfect predictions). This is fine—these samples simply don't contribute to the next tree's training.

Undefined gradients: At discontinuities (e.g., $|y-F|$ at $y=F$), use subgradients or smooth approximations.

Multi-output problems: For multi-class classification, compute pseudo-residuals for each class separately:

$$\tilde{r}{ik} = y{ik} - p_k(x_i)$$

where $p_k$ is the predicted probability for class $k$.

Common Implementation Pitfalls

•Forgetting the negative: Pseudo-residuals are the NEGATIVE gradient. The base learner should predict what to ADD to the current model, not the gradient itself.
•Incorrect log-odds handling: For classification, predictions are often log-odds (unbounded), not probabilities. Ensure you compute gradients with respect to log-odds.
•Label encoding issues: Binary classification with {-1, +1} labels has different pseudo-residuals than {0, 1} labels. Match your loss function to your encoding.
•Overflow in exponentials: exp(large_number) overflows. Use stable sigmoid/softmax implementations that handle extreme values.
•Division by zero in softmax: For multi-class, ensure softmax denominators don't underflow to zero.

Second-Order Methods

Modern implementations like XGBoost use second-order Taylor expansions, computing both the gradient (first derivative) and Hessian (second derivative). The Hessian provides curvature information that improves split finding and enables optimization-aware tree construction. We'll cover this in detail when discussing XGBoost.

Pseudo-Residuals in Multi-Class Classification

Multi-class classification requires special treatment because we predict multiple outputs (one per class). Let's derive pseudo-residuals for the multinomial loss with $K$ classes.

Softmax Parameterization

For a sample $x$ with raw predictions $F_1(x), \ldots, F_K(x)$, the class probabilities are:

$$p_k(x) = \frac{e^{F_k(x)}}{\sum_{j=1}^{K} e^{F_j(x)}}$$

Multinomial Deviance Loss

With one-hot encoded labels $y_{ik} \in {0, 1}$ indicating whether sample $i$ belongs to class $k$:

$$L(y, F) = -\sum_{k=1}^{K} y_k \log p_k$$

Pseudo-Residual Derivation

$$\frac{\partial L}{\partial F_k} = \frac{\partial}{\partial F_k}\left[ -\sum_{j} y_j \log p_j \right]$$

Using the softmax derivative properties:

$$\frac{\partial \log p_k}{\partial F_k} = 1 - p_k, \quad \frac{\partial \log p_j}{\partial F_k} = -p_k \text{ for } j \neq k$$

We get:

$$\frac{\partial L}{\partial F_k} = -y_k(1 - p_k) + \sum_{j \neq k} y_j \cdot p_k = p_k - y_k$$

Pseudo-residual for class $k$: $$\tilde{r}_k = y_k - p_k$$

multiclass_pseudo_residuals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
 
def stable_softmax(logits):
    """
    Numerically stable softmax.
    
    Subtracts max for stability without changing result
    (since softmax is shift-invariant).
    """
    logits_stable = logits - np.max(logits, axis=-1, keepdims=True)
    exp_logits = np.exp(logits_stable)
    return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)
 
def multiclass_pseudo_residuals(y_true_onehot, raw_predictions):
    """
    Compute pseudo-residuals for multi-class classification.
    
    Parameters:
    -----------
    y_true_onehot : array of shape (n_samples, n_classes)
        One-hot encoded true labels
    raw_predictions : array of shape (n_samples, n_classes)
        Raw model outputs (logits) for each class
    
    Returns:
    --------
    residuals : array of shape (n_samples, n_classes)
        Pseudo-residuals for each sample and class
    """
    probabilities = stable_softmax(raw_predictions)
    return y_true_onehot - probabilities
 
# Example: 3-class classification
n_samples = 5
n_classes = 3
 
# True labels (one-hot)
y_true = np.array([0, 1, 2, 0, 1])  # Class indices
y_onehot = np.eye(n_classes)[y_true]
 
# Current model predictions (logits)
np.random.seed(42)
logits = np.random.randn(n_samples, n_classes)
 
# Compute pseudo-residuals
proba = stable_softmax(logits)
residuals = multiclass_pseudo_residuals(y_onehot, logits)
 
print("Sample 0 (true class = 0):")
print(f"  Probabilities: {proba[0].round(3)}")
print(f"  One-hot:       {y_onehot[0]}")
print(f"  Residuals:     {residuals[0].round(3)}")
print(f"  Interpretation: Push class 0 by +{residuals[0, 0]:.3f}, "
      f"class 1 by {residuals[0, 1]:.3f}, class 2 by {residuals[0, 2]:.3f}")
 
# Note: Each sample produces K pseudo-residuals (one per class)
# Gradient boosting fits K trees per iteration (one per class)

Multi-Tree vs Single-Tree Approaches

Multi-tree per iteration: Fit one tree to pseudo-residuals for each class. This is the standard approach in scikit-learn's GradientBoostingClassifier.

$$h_m^{(k)}(x) \text{ fitted to } {(x_i, \tilde{r}{ik})}{i=1}^n \text{ for each class } k$$

Single-tree with multi-output leaves: XGBoost and LightGBM build one tree per iteration where each leaf contains $K$ output values. This is more efficient but requires careful implementation.

Computational Cost

For K-class classification with M iterations, the standard approach requires K×M trees total. With 100 classes and 500 iterations, that's 50,000 trees! Modern implementations use tricks like single-tree multi-output to reduce this overhead while maintaining performance.

Summary: Pseudo-Residuals

We have thoroughly examined pseudo-residuals—the fundamental learning signal that makes gradient boosting a universal framework. Let's consolidate the key takeaways:

Key Takeaways

•Pseudo-residuals are negative gradients: They indicate the direction in which predictions should change to reduce the loss function.
•They generalize classical residuals: For squared loss, pseudo-residuals equal ordinary residuals. For other losses, they capture the loss-specific direction of improvement.
•Different losses produce different signals: L1 loss yields sign-only residuals (robust); log loss yields probability residuals (classification); ranking losses yield aggregated pairwise signals.
•Unified training procedure: Regardless of the loss, we always fit base learners using squared loss to pseudo-residuals. This abstraction is gradient boosting's key strength.
•Geometric interpretation: Pseudo-residuals define the steepest descent direction in prediction space. The base learner projects this direction onto its function class.
•Numerical stability matters: Production implementations require careful handling of extreme values, zero divisions, and multi-class extensions.

What's Next

With pseudo-residuals understood, we next examine base learner fitting—how decision trees are trained to approximate the gradient direction and why their properties (depth, splitting criteria, regularization) profoundly impact gradient boosting's success. We'll see how trees balance gradient approximation accuracy with generalization.

Page Complete

You now understand pseudo-residuals at a fundamental level—how they're computed, what they represent, and why they enable gradient boosting to handle any differentiable loss function. This understanding is essential for debugging gradient boosting models and designing custom loss functions.

2 / 5

Loading learning content...

Ensemble MethodsGradient Boosting

Gradient Boosting

LevelAdvanced

Duration90 mins

TopicGradient Boosting

2 / 5

Pseudo-Residuals

The Universal Learning Signal

But what exactly are pseudo-residuals? Why do we call them 'pseudo' rather than simply 'residuals'? And how do we compute them for different loss functions?

This page answers these questions comprehensively, providing the mathematical foundation and practical intuition for working with pseudo-residuals across diverse machine learning tasks.

What You Will Learn

From Residuals to Pseudo-Residuals

Classical Residuals

In traditional statistics, a residual is simply the difference between the observed value and the predicted value:

$$r_i = y_i - \hat{y}_i$$

The Limitation of Classical Residuals

The classical residual makes sense for squared loss:

$$L(y, F) = \frac{1}{2}(y - F)^2$$

But what about other loss functions? Consider absolute loss for robust regression:

$$L(y, F) = |y - F|$$

Or logistic loss for classification:

$$L(y, F) = \log(1 + e^{-yF})$$

The Pseudo-Residual Generalization

Pseudo-residuals generalize classical residuals to arbitrary loss functions. They are defined as the negative gradient of the loss with respect to the current prediction:

$$\tilde{r}i = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \bigg|{F=F_{m-1}}$$

We call these 'pseudo' residuals because:

They generalize the concept of residuals beyond squared loss
For squared loss, they exactly equal classical residuals
For other losses, they indicate the direction of improvement, not just the prediction error

The Key Insight

Classical Residuals vs. Pseudo-Residuals
Property	Classical Residuals	Pseudo-Residuals
Definition	$r_i = y_i - F(x_i)$	$\tilde{r}_i = -\frac{\partial L}{\partial F(x_i)}$
Loss function	Implicitly squared loss	Adapts to any differentiable loss
For squared loss	$y_i - F(x_i)$	$y_i - F(x_i)$ (identical)
For absolute loss	$y_i - F(x_i)$	$\text{sign}(y_i - F(x_i))$
Interpretation	Prediction error	Direction of steepest descent
Magnitude	Error magnitude	Gradient magnitude

Mathematical Derivation

Let's rigorously derive the pseudo-residual formula and understand its connection to gradient descent in function space.

The Functional Gradient

Consider the total loss over the training set:

$$\mathcal{L}(F) = \sum_{i=1}^{n} L(y_i, F(x_i))$$

We want to find the direction in function space that most reduces this loss. If we could perturb $F$ by a small function $\epsilon \cdot h$, the change in loss is:

$$\mathcal{L}(F + \epsilon h) - \mathcal{L}(F) \approx \epsilon \sum_{i=1}^{n} \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \cdot h(x_i)$$

This is a directional derivative. The steepest descent direction is the negative gradient:

$$-\nabla_F \mathcal{L} = \left( -\frac{\partial L(y_1, F(x_1))}{\partial F(x_1)}, \ldots, -\frac{\partial L(y_n, F(x_n))}{\partial F(x_n)} \right)$$

Connecting to Gradient Descent

In standard parameter-space gradient descent, we compute $\nabla_\theta \mathcal{L}$ and update $\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}$.

In function-space gradient descent, the 'gradient' at the training points is the vector of partial derivatives. To update the function, we'd ideally subtract this gradient:

$$F^{(t+1)}(x_i) = F^{(t)}(x_i) - \eta \cdot \frac{\partial L(y_i, F^{(t)}(x_i))}{\partial F^{(t)}(x_i)}$$

But we need predictions at unseen points too!

This is where the base learner comes in. We fit a function $h_m$ to approximate the negative gradient vector, allowing generalization beyond the training set:

$$h_m \approx \left( -\frac{\partial L}{\partial F(x_1)}, \ldots, -\frac{\partial L}{\partial F(x_n)} \right)$$

Why Fit a Learner to the Gradient?

The Pseudo-Residual as Target

Define the pseudo-residual for sample $i$ at iteration $m$:

$$\tilde{r}{im} = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \bigg|{F=F_{m-1}}$$

We train the base learner by minimizing:

$$h_m = \arg\min_{h \in \mathcal{H}} \sum_{i=1}^{n} (\tilde{r}_{im} - h(x_i))^2$$

Geometric Interpretation

Each iteration takes a step in 'function space' toward the minimum—hence the name 'gradient descent in function space.'

Pseudo-Residuals for Common Loss Functions

Let's derive pseudo-residuals for the most commonly used loss functions in practice. Understanding these derivations builds intuition for how gradient boosting adapts to different tasks.

1. Squared Loss (L2 Loss) — Regression

$$L(y, F) = \frac{1}{2}(y - F)^2$$

Gradient: $$\frac{\partial L}{\partial F} = \frac{\partial}{\partial F}\left[\frac{1}{2}(y - F)^2\right] = -(y - F) = F - y$$

Pseudo-residual (negative gradient): $$\tilde{r} = -(F - y) = y - F$$

pseudo_residuals_squared_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np
 
def pseudo_residuals_squared_loss(y_true, y_pred):
    """
    Pseudo-residuals for squared loss: r = y - F
    
    The model should predict higher where y > F (positive residual)
    and lower where y < F (negative residual).
    """
    return y_true - y_pred
 
# Example
y_true = np.array([3.0, 5.0, 7.0, 9.0])
y_pred = np.array([2.5, 5.5, 6.0, 10.0])
residuals = pseudo_residuals_squared_loss(y_true, y_pred)
print(f"True values:       {y_true}")
print(f"Predictions:       {y_pred}")
print(f"Pseudo-residuals:  {residuals}")
# Output: [0.5, -0.5, 1.0, -1.0]

2. Absolute Loss (L1 Loss) — Robust Regression

$$L(y, F) = |y - F|$$

Pseudo-residual: $$\tilde{r} = \text{sign}(y - F)$$

pseudo_residuals_absolute_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
 
def pseudo_residuals_absolute_loss(y_true, y_pred):
    """
    Pseudo-residuals for absolute loss: r = sign(y - F)
    
    Robust to outliers: large and small errors produce
    the same magnitude gradient.
    """
    return np.sign(y_true - y_pred)
 
# Example with an outlier
y_true = np.array([3.0, 5.0, 7.0, 100.0])  # 100 is an outlier
y_pred = np.array([2.5, 5.5, 6.0, 10.0])
 
residuals_l2 = y_true - y_pred  # Squared loss residuals
residuals_l1 = pseudo_residuals_absolute_loss(y_true, y_pred)
 
print(f"L2 pseudo-residuals: {residuals_l2}")
# Output: [0.5, -0.5, 1.0, 90.0]  <-- Outlier dominates!
 
print(f"L1 pseudo-residuals: {residuals_l1}")
# Output: [1.0, -1.0, 1.0, 1.0]  <-- Outlier has same influence

3. Huber Loss — Smooth Robust Regression

$$L_\delta(y, F) = \begin{cases} \frac{1}{2}(y - F)^2 & \text{if } |y - F| \leq \delta \ \delta|y - F| - \frac{\delta^2}{2} & \text{otherwise} \end{cases}$$

Gradient: $$\frac{\partial L_\delta}{\partial F} = \begin{cases} F - y & \text{if } |y - F| \leq \delta \ -\delta \cdot \text{sign}(y - F) & \text{otherwise} \end{cases}$$

Pseudo-residual: $$\tilde{r} = \begin{cases} y - F & \text{if } |y - F| \leq \delta \ \delta \cdot \text{sign}(y - F) & \text{otherwise} \end{cases}$$

Choosing δ in Huber Loss

4. Log Loss (Binary Cross-Entropy) — Binary Classification

For binary classification with labels $y \in {0, 1}$ and predictions as log-odds $F$ (so $P(y=1) = \sigma(F) = \frac{1}{1+e^{-F}}$):

$$L(y, F) = -y \cdot F + \log(1 + e^F) = -y \log(\sigma(F)) - (1-y)\log(1-\sigma(F))$$

Gradient: $$\frac{\partial L}{\partial F} = -y + \frac{e^F}{1+e^F} = -y + \sigma(F) = \sigma(F) - y$$

Pseudo-residual: $$\tilde{r} = y - \sigma(F) = y - P(y=1|x)$$

This is exactly the probability residual—the analog of error in probability space. The next tree fits to these probability residuals, learning to increase or decrease predicted log-odds.

pseudo_residuals_log_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
 
def sigmoid(x):
    """Numerically stable sigmoid function."""
    return np.where(
        x >= 0,
        1 / (1 + np.exp(-x)),
        np.exp(x) / (1 + np.exp(x))
    )
 
def pseudo_residuals_log_loss(y_true, raw_predictions):
    """
    Pseudo-residuals for log loss (binary classification).
    
    Parameters:
    -----------
    y_true : array of {0, 1}
        True binary labels
    raw_predictions : array
        Current model predictions in log-odds (before sigmoid)
    
    Returns:
    --------
    residuals : array
        y - sigmoid(F) for each sample
    """
    predicted_proba = sigmoid(raw_predictions)
    return y_true - predicted_proba
 
# Example: Confident correct vs. Confident wrong
y_true = np.array([1, 1, 0, 0])
raw_predictions = np.array([2.0, -1.0, -2.0, 1.0])  # Log-odds
 
predicted_proba = sigmoid(raw_predictions)
residuals = pseudo_residuals_log_loss(y_true, raw_predictions)
 
print("True labels:      ", y_true)
print("Log-odds (F):     ", raw_predictions)
print("Probabilities:    ", np.round(predicted_proba, 3))
print("Pseudo-residuals: ", np.round(residuals, 3))
 
# Output:
# True labels:       [1 1 0 0]
# Log-odds (F):      [ 2. -1. -2.  1.]
# Probabilities:     [0.881 0.269 0.119 0.731]
# Pseudo-residuals:  [ 0.119  0.731 -0.119 -0.731]
#                     ^ small (correct, confident)
#                           ^ large (wrong, model needs to change!)

5. Pairwise Loss — Learning to Rank

For ranking tasks, we often use pairwise losses that compare items. Given a pair $(i, j)$ where item $i$ should rank higher than $j$:

$$L_{ij}(F) = \log(1 + e^{-(F(x_i) - F(x_j))})$$

Gradient with respect to $F(x_i)$: $$\frac{\partial L_{ij}}{\partial F(x_i)} = -\sigma(F(x_j) - F(x_i)) = \frac{-1}{1 + e^{F(x_i) - F(x_j)}}$$

Pseudo-residual: $$\tilde{r}i = \sum{j: i \succ j} \sigma(F(x_j) - F(x_i)) - \sum_{k: k \succ i} \sigma(F(x_i) - F(x_k))$$

A Unified View of Pseudo-Residuals

Having examined several specific cases, let's synthesize a unified understanding of pseudo-residuals and their role in gradient boosting.

Pseudo-Residuals Across Loss Functions
Loss Function	Task	Pseudo-Residual $\tilde{r}$	Intuition
Squared (L2)	Regression	$y - F$	Predict more where error is positive
Absolute (L1)	Robust regression	$\text{sign}(y - F)$	Direction only, outlier-robust
Huber	Robust regression	Capped residual	L2 for small errors, L1 for large
Log loss	Binary classification	$y - \sigma(F)$	Probability error
Multinomial	Multi-class	$y_k - p_k$ per class	Class probability error
Pairwise	Ranking	Aggregated pair comparisons	Push up/down based on rank violations
Quantile	Quantile regression	Asymmetric sign	Above/below with asymmetric weight

Universal Properties of Pseudo-Residuals

•Always point toward loss reduction: By definition, the negative gradient is the direction of steepest descent. Moving predictions in this direction improves the objective.
•Magnitude indicates improvement potential: Larger pseudo-residuals suggest greater potential for loss reduction at that point. This naturally focuses the next learner on high-value corrections.
•Adapt to the loss function's preferences: L2 loss emphasizes large errors (squared); L1 loss treats all errors equally (sign only). The pseudo-residual reflects these preferences.
•Enable unified training: Regardless of the original loss, we always fit base learners using squared loss to pseudo-residuals. One training procedure, infinite loss functions.
•Support custom losses: Any differentiable loss function yields pseudo-residuals through differentiation. This includes custom business metrics made differentiable through surrogate losses.

The Power of Abstraction

Geometric Interpretation

Visualizing pseudo-residuals geometrically deepens understanding. Let's think about what happens in the space of predictions.

The Prediction Space View

For $n$ training samples, the current model produces an $n$-dimensional prediction vector:

$$\mathbf{F} = (F(x_1), \ldots, F(x_n)) \in \mathbb{R}^n$$

The loss function defines a scalar surface over this space:

$$\mathcal{L}(\mathbf{F}) = \sum_{i=1}^{n} L(y_i, F_i)$$

Each training iteration: We stand at point $\mathbf{F}^{(m-1)}$, compute the gradient $\nabla \mathcal{L}$, and try to move toward lower loss.

The Gradient as Direction

The pseudo-residual vector

$$\tilde{\mathbf{r}} = -\nabla \mathcal{L} = \left( -\frac{\partial L}{\partial F_1}, \ldots, -\frac{\partial L}{\partial F_n} \right)$$

points in the direction of steepest descent. If we could update predictions freely:

$$\mathbf{F}^{(m)} = \mathbf{F}^{(m-1)} + \eta \tilde{\mathbf{r}}$$

we'd take a pure gradient descent step.

The Base Learner Constraint

But we can't update predictions arbitrarily—we need to generalize to unseen data. The base learner constrains us to updates that can be expressed as $h(x)$:

$$\mathbf{F}^{(m)} = \mathbf{F}^{(m-1)} + \eta \cdot (h_m(x_1), \ldots, h_m(x_n))$$

The base learner finds the best approximation to the gradient direction within its function class.

visualize_gradient_direction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
import matplotlib.pyplot as plt
 
def visualize_pseudo_residual_geometry():
    """
    Visualize pseudo-residuals as gradient directions in prediction space.
    
    For 2 training samples, we can plot the full loss surface and gradient.
    """
    # Two training samples with targets 1.0 and 2.0
    y = np.array([1.0, 2.0])
    
    # Create a grid of possible predictions
    F1 = np.linspace(-1, 4, 100)
    F2 = np.linspace(-1, 4, 100)
    F1_grid, F2_grid = np.meshgrid(F1, F2)
    
    # Compute squared loss surface
    loss_surface = 0.5 * ((F1_grid - y[0])**2 + (F2_grid - y[1])**2)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    # Left: Loss contours with gradient vector
    ax = axes[0]
    contour = ax.contour(F1_grid, F2_grid, loss_surface, levels=20, cmap='viridis')
    ax.clabel(contour, inline=True, fontsize=8)
    
    # Current predictions and gradient
    F_current = np.array([0.5, 1.0])  # Current predictions
    pseudo_residuals = y - F_current  # Gradient for squared loss
    
    ax.scatter(*F_current, s=100, c='red', zorder=5, label='Current F')
    ax.scatter(*y, s=100, c='green', marker='*', zorder=5, label='Target y')
    ax.arrow(F_current[0], F_current[1], 
             0.5*pseudo_residuals[0], 0.5*pseudo_residuals[1],
             head_width=0.1, head_length=0.05, fc='red', ec='red',
             label='Pseudo-residual direction')
    
    ax.set_xlabel('F(x₁)')
    ax.set_ylabel('F(x₂)')
    ax.set_title('Squared Loss Surface with Gradient')
    ax.legend()
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
    
    # Right: Iteration trajectory
    ax = axes[1]
    contour = ax.contour(F1_grid, F2_grid, loss_surface, levels=20, cmap='viridis', alpha=0.5)
    
    # Simulate gradient descent
    F_trajectory = [F_current.copy()]
    learning_rate = 0.3
    for _ in range(10):
        pseudo_res = y - F_trajectory[-1]
        F_new = F_trajectory[-1] + learning_rate * pseudo_res
        F_trajectory.append(F_new)
    
    F_trajectory = np.array(F_trajectory)
    ax.plot(F_trajectory[:, 0], F_trajectory[:, 1], 'ro-', markersize=8, linewidth=2)
    ax.scatter(*y, s=150, c='green', marker='*', zorder=5, label='Target (optimum)')
    
    ax.set_xlabel('F(x₁)')
    ax.set_ylabel('F(x₂)')
    ax.set_title('Gradient Descent Trajectory')
    ax.legend()
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('pseudo_residual_geometry.png', dpi=150)
    plt.show()
 
visualize_pseudo_residual_geometry()

Projection onto the Model Space

The key insight is that fitting a base learner to pseudo-residuals is equivalent to projecting the gradient onto the subspace of functions representable by the base learner.

In linear algebra terms:

The full gradient $\tilde{\mathbf{r}}$ lives in $\mathbb{R}^n$
The base learner outputs span a subspace $\mathcal{S} \subseteq \mathbb{R}^n$
The fitted base learner $h_m$ represents the projection of $\tilde{\mathbf{r}}$ onto $\mathcal{S}$

Decision trees project onto the subspace of piecewise-constant functions. Deeper trees span a larger subspace, allowing better gradient approximation but risking overfitting.

Why Shallow Trees Often Work Best

Practical Computation Considerations

Computing pseudo-residuals correctly is crucial for gradient boosting implementations. Let's examine practical issues that arise in real-world settings.

Numerical Stability

For some loss functions, naive gradient computation can cause numerical issues:

Log loss with extreme predictions: If $F$ is very large or very negative, $\sigma(F)$ approaches 1 or 0, and $\log(\sigma(F))$ or $\log(1-\sigma(F))$ can overflow.

Solution: Use numerically stable formulations:

stable_pseudo_residuals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
 
def stable_sigmoid(x):
    """
    Numerically stable sigmoid that avoids overflow.
    """
    # For x >= 0: 1 / (1 + exp(-x)) is stable
    # For x < 0: exp(x) / (1 + exp(x)) is stable  
    return np.where(
        x >= 0,
        1.0 / (1.0 + np.exp(-x)),
        np.exp(x) / (1.0 + np.exp(x))
    )
 
def stable_log_loss_pseudo_residuals(y_true, raw_predictions, clip_value=1e-15):
    """
    Compute pseudo-residuals for log loss with numerical safeguards.
    
    Parameters:
    -----------
    y_true : array of {0, 1}
    raw_predictions : array of raw scores (log-odds)
    clip_value : float
        Minimum probability to prevent log(0)
    """
    # Stable sigmoid computation
    proba = stable_sigmoid(raw_predictions)
    
    # Clip probabilities to prevent extreme gradients
    proba = np.clip(proba, clip_value, 1.0 - clip_value)
    
    # Pseudo-residual: y - p
    return y_true - proba
 
# Test with extreme values
raw_preds = np.array([-100, -10, 0, 10, 100])
proba = stable_sigmoid(raw_preds)
print(f"Raw predictions: {raw_preds}")
print(f"Stable sigmoid:  {proba}")
# Output: [0., 0.0000454, 0.5, 0.9999546, 1.]
# Note: Doesn't overflow even for extreme values

Handling Edge Cases

Zero gradients: Some samples may have zero pseudo-residuals (perfect predictions). This is fine—these samples simply don't contribute to the next tree's training.

Undefined gradients: At discontinuities (e.g., $|y-F|$ at $y=F$), use subgradients or smooth approximations.

Multi-output problems: For multi-class classification, compute pseudo-residuals for each class separately:

$$\tilde{r}{ik} = y{ik} - p_k(x_i)$$

where $p_k$ is the predicted probability for class $k$.

Common Implementation Pitfalls

•Forgetting the negative: Pseudo-residuals are the NEGATIVE gradient. The base learner should predict what to ADD to the current model, not the gradient itself.
•Incorrect log-odds handling: For classification, predictions are often log-odds (unbounded), not probabilities. Ensure you compute gradients with respect to log-odds.
•Label encoding issues: Binary classification with {-1, +1} labels has different pseudo-residuals than {0, 1} labels. Match your loss function to your encoding.
•Overflow in exponentials: exp(large_number) overflows. Use stable sigmoid/softmax implementations that handle extreme values.
•Division by zero in softmax: For multi-class, ensure softmax denominators don't underflow to zero.

Second-Order Methods

Pseudo-Residuals in Multi-Class Classification

Multi-class classification requires special treatment because we predict multiple outputs (one per class). Let's derive pseudo-residuals for the multinomial loss with $K$ classes.

Softmax Parameterization

For a sample $x$ with raw predictions $F_1(x), \ldots, F_K(x)$, the class probabilities are:

$$p_k(x) = \frac{e^{F_k(x)}}{\sum_{j=1}^{K} e^{F_j(x)}}$$

Multinomial Deviance Loss

With one-hot encoded labels $y_{ik} \in {0, 1}$ indicating whether sample $i$ belongs to class $k$:

$$L(y, F) = -\sum_{k=1}^{K} y_k \log p_k$$

Pseudo-Residual Derivation

$$\frac{\partial L}{\partial F_k} = \frac{\partial}{\partial F_k}\left[ -\sum_{j} y_j \log p_j \right]$$

Using the softmax derivative properties:

$$\frac{\partial \log p_k}{\partial F_k} = 1 - p_k, \quad \frac{\partial \log p_j}{\partial F_k} = -p_k \text{ for } j \neq k$$

We get:

$$\frac{\partial L}{\partial F_k} = -y_k(1 - p_k) + \sum_{j \neq k} y_j \cdot p_k = p_k - y_k$$

Pseudo-residual for class $k$: $$\tilde{r}_k = y_k - p_k$$

multiclass_pseudo_residuals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
 
def stable_softmax(logits):
    """
    Numerically stable softmax.
    
    Subtracts max for stability without changing result
    (since softmax is shift-invariant).
    """
    logits_stable = logits - np.max(logits, axis=-1, keepdims=True)
    exp_logits = np.exp(logits_stable)
    return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)
 
def multiclass_pseudo_residuals(y_true_onehot, raw_predictions):
    """
    Compute pseudo-residuals for multi-class classification.
    
    Parameters:
    -----------
    y_true_onehot : array of shape (n_samples, n_classes)
        One-hot encoded true labels
    raw_predictions : array of shape (n_samples, n_classes)
        Raw model outputs (logits) for each class
    
    Returns:
    --------
    residuals : array of shape (n_samples, n_classes)
        Pseudo-residuals for each sample and class
    """
    probabilities = stable_softmax(raw_predictions)
    return y_true_onehot - probabilities
 
# Example: 3-class classification
n_samples = 5
n_classes = 3
 
# True labels (one-hot)
y_true = np.array([0, 1, 2, 0, 1])  # Class indices
y_onehot = np.eye(n_classes)[y_true]
 
# Current model predictions (logits)
np.random.seed(42)
logits = np.random.randn(n_samples, n_classes)
 
# Compute pseudo-residuals
proba = stable_softmax(logits)
residuals = multiclass_pseudo_residuals(y_onehot, logits)
 
print("Sample 0 (true class = 0):")
print(f"  Probabilities: {proba[0].round(3)}")
print(f"  One-hot:       {y_onehot[0]}")
print(f"  Residuals:     {residuals[0].round(3)}")
print(f"  Interpretation: Push class 0 by +{residuals[0, 0]:.3f}, "
      f"class 1 by {residuals[0, 1]:.3f}, class 2 by {residuals[0, 2]:.3f}")
 
# Note: Each sample produces K pseudo-residuals (one per class)
# Gradient boosting fits K trees per iteration (one per class)

Multi-Tree vs Single-Tree Approaches

Multi-tree per iteration: Fit one tree to pseudo-residuals for each class. This is the standard approach in scikit-learn's GradientBoostingClassifier.

$$h_m^{(k)}(x) \text{ fitted to } {(x_i, \tilde{r}{ik})}{i=1}^n \text{ for each class } k$$

Single-tree with multi-output leaves: XGBoost and LightGBM build one tree per iteration where each leaf contains $K$ output values. This is more efficient but requires careful implementation.

Computational Cost

Summary: Pseudo-Residuals

We have thoroughly examined pseudo-residuals—the fundamental learning signal that makes gradient boosting a universal framework. Let's consolidate the key takeaways:

Key Takeaways

•Pseudo-residuals are negative gradients: They indicate the direction in which predictions should change to reduce the loss function.
•They generalize classical residuals: For squared loss, pseudo-residuals equal ordinary residuals. For other losses, they capture the loss-specific direction of improvement.
•Different losses produce different signals: L1 loss yields sign-only residuals (robust); log loss yields probability residuals (classification); ranking losses yield aggregated pairwise signals.
•Unified training procedure: Regardless of the loss, we always fit base learners using squared loss to pseudo-residuals. This abstraction is gradient boosting's key strength.
•Geometric interpretation: Pseudo-residuals define the steepest descent direction in prediction space. The base learner projects this direction onto its function class.
•Numerical stability matters: Production implementations require careful handling of extreme values, zero divisions, and multi-class extensions.

What's Next

Page Complete

2 / 5