Machine LearningModern Boosting

CatBoost

LevelAdvanced

Duration90 mins

TopicModern Boosting

1 / 5

Ordered Boosting

The Hidden Flaw in Gradient Boosting

Gradient boosting has dominated machine learning competitions and production systems for over a decade. Yet beneath its remarkable success lies a fundamental statistical flaw that went largely unrecognized until CatBoost's developers at Yandex formally characterized and solved it.

This flaw is called prediction shift, sometimes referred to as target leakage in the boosting literature. It occurs because traditional gradient boosting uses the same data points to both compute residuals and train the next tree—creating a subtle but systematic bias that degrades generalization performance.

Understanding prediction shift requires us to think carefully about what gradient boosting is actually doing: it's fitting models to gradients (negative residuals) computed from predictions made by the current ensemble. When those predictions are made on the same data used to train them, information leaks from the targets into the training process in ways that don't generalize.

Why This Matters

Prediction shift causes gradient boosting to learn patterns that are artifacts of the training process rather than genuine signal in the data. This leads to overfitting that cannot be detected through standard cross-validation, making models appear better during training than they actually perform on new data.

Understanding the Prediction Shift Problem

To understand prediction shift, we must examine the mechanics of gradient boosting at a granular level. Consider the standard gradient boosting procedure:

Initialize the model: $F_0(x) = \arg\min_c \sum_{i=1}^n L(y_i, c)$
For each boosting iteration $t = 1, 2, ..., T$:
- Compute pseudo-residuals: $r_{ti} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]{F=F{t-1}}$
- Fit a base learner $h_t$ to the residuals $(x_i, r_{ti})$
- Update the model: $F_t(x) = F_{t-1}(x) + \eta h_t(x)$

The critical observation is in step 2: we compute residuals $r_{ti}$ using predictions $F_{t-1}(x_i)$ on the same data points $(x_i, y_i)$ that were used to train $F_{t-1}$.

This creates what statisticians call conditioning on the training set—the residuals we're fitting are not independent samples from any meaningful distribution, but rather quantities that have been optimized on the very data we're using to compute them.

The Core Problem

When we compute the residual for sample $i$ using $F_{t-1}(x_i)$, that prediction was influenced by the target $y_i$ itself (since $y_i$ was used in training). This means the residual $r_{ti}$ contains information about $y_i$ beyond what the features $x_i$ can legitimately provide.

A Concrete Example of Prediction Shift

Imagine we have a dataset where the true relationship is $y = f(x) + \epsilon$, with $\epsilon$ being pure noise. An ideal learner would capture $f(x)$ and leave the noise alone.

In traditional gradient boosting:

The first tree $h_1$ fits the data, capturing some of $f(x)$ but also fitting some of the noise $\epsilon_i$ for each sample $i$.
When we compute residuals $r_{2i} = y_i - F_1(x_i)$, these residuals are smaller for samples where $h_1$ happened to fit the noise well (by chance).
The second tree $h_2$ now sees residuals that are systematically biased—they underestimate the true residual for samples that $h_1$ overfit.
This bias propagates and compounds through subsequent iterations.

The result: the ensemble becomes increasingly calibrated to artifacts of how it fit the training noise, rather than the underlying signal.

Prediction Shift Manifestation Across Iterations
Iteration	What Should Happen	What Actually Happens	Consequence
t = 1	Fit signal from data	Fit signal + some training noise	Residuals are biased downward for overfit samples
t = 2	Fit remaining signal	Fit biased residuals from t=1	Bias compounds; noise in t=1 affects t=2's targets
t = 3...T	Progressively refine	Progressively fit training artifacts	Model increasingly specialized to training set
Final	Strong generalization	Apparent low training error	Hidden overfitting; poor test performance

Mathematical Formalization of Prediction Shift

Let's formalize the prediction shift problem mathematically. Consider a dataset $\mathcal{D} = {(x_1, y_1), ..., (x_n, y_n)}$ drawn i.i.d. from distribution $P(X, Y)$.

For a gradient boosting model $F(x)$ trained on $\mathcal{D}$, define:

Conditional expectation given features (the target): $$\mu(x) = \mathbb{E}[Y | X = x]$$

Training prediction for sample $i$: $$\hat{y}_i = F(x_i; \mathcal{D})$$

where $F(x_i; \mathcal{D})$ explicitly denotes that the model was trained on $\mathcal{D}$, which includes $(x_i, y_i)$.

The residual used for gradient computation: $$r_i = y_i - F(x_i; \mathcal{D})$$

Now here's the critical insight. The expected residual conditional on the training set is: $$\mathbb{E}[r_i | \mathcal{D}] = \mathbb{E}[y_i | \mathcal{D}] - F(x_i; \mathcal{D})$$

But because $y_i$ is part of $\mathcal{D}$: $$\mathbb{E}[y_i | \mathcal{D}] = y_i$$

Therefore: $$\mathbb{E}[r_i | \mathcal{D}] = y_i - F(x_i; \mathcal{D})$$

This is not the same as the true expected residual we'd want to fit: $$\mathbb{E}[y - F(x) | X = x_i] = \mu(x_i) - \mathbb{E}[F(x_i)]$$

The Statistical Insight

The residuals we compute are conditioned on the specific targets $y_i$, not just the features $x_i$. This creates a systematic difference between the distribution of training residuals and the distribution of residuals we'd see on new data with the same features but different target noise.

Quantifying the Shift

The prediction shift can be precisely quantified. For a model trained on $\mathcal{D}$ and applied to a new point $(x, y)$ from the same distribution:

$$\text{Shift}(x) = \mathbb{E}{\mathcal{D}'}[F(x; \mathcal{D}')] - \mathbb{E}{\mathcal{D}}[F(x; \mathcal{D}) | (x, y) \in \mathcal{D}]$$

where $\mathcal{D}'$ is an independent training set of the same size.

This shift measures how different our predictions are when $x$ is in the training set versus when it's held out. For an unbiased learner, this should be zero—but gradient boosting systematically produces non-zero shift due to the target leakage in residual computation.

The magnitude of prediction shift typically:

Increases with the number of boosting iterations
Increases with the complexity (depth) of base learners
Decreases with dataset size (more data dilutes the influence of any single point)
Increases with the learning rate (larger steps amplify the bias)

prediction_shift_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_predict
import matplotlib.pyplot as plt
 
def demonstrate_prediction_shift(n_samples=1000, n_iterations=100):
    """
    Demonstrate prediction shift by comparing in-sample vs out-of-sample residuals.
    
    Prediction shift manifests as systematically smaller in-sample residuals
    compared to out-of-sample residuals for the same underlying data distribution.
    """
    np.random.seed(42)
    
    # Generate data with known noise structure
    X = np.random.randn(n_samples, 5)
    true_signal = np.sin(X[:, 0]) + 0.5 * X[:, 1]  # True relationship
    noise = np.random.randn(n_samples) * 0.5       # Pure noise
    y = true_signal + noise
    
    # Train gradient boosting model
    gb = GradientBoostingRegressor(
        n_estimators=n_iterations,
        learning_rate=0.1,
        max_depth=4,
        random_state=42
    )
    gb.fit(X, y)
    
    # In-sample predictions (same data used for training)
    in_sample_preds = gb.predict(X)
    in_sample_residuals = y - in_sample_preds
    
    # Out-of-sample predictions via cross-validation
    # Each prediction is made by a model that never saw that sample
    cv_preds = cross_val_predict(
        GradientBoostingRegressor(
            n_estimators=n_iterations,
            learning_rate=0.1,
            max_depth=4,
            random_state=42
        ),
        X, y, cv=5
    )
    cv_residuals = y - cv_preds
    
    # Compute statistics demonstrating prediction shift
    in_sample_rmse = np.sqrt(np.mean(in_sample_residuals**2))
    cv_rmse = np.sqrt(np.mean(cv_residuals**2))
    shift_ratio = cv_rmse / in_sample_rmse
    
    print(f"In-sample RMSE: {in_sample_rmse:.4f}")
    print(f"Cross-validated RMSE: {cv_rmse:.4f}")
    print(f"Prediction shift ratio: {shift_ratio:.2f}x")
    print(f"\nThis {shift_ratio:.2f}x difference is prediction shift in action.")
    print("Traditional GB makes residuals appear smaller than they truly are.")
    
    return in_sample_residuals, cv_residuals, shift_ratio
 
# Run demonstration
in_sample, cv_residuals, ratio = demonstrate_prediction_shift()

CatBoost's Ordered Boosting Solution

CatBoost introduces ordered boosting, a principled solution to prediction shift that maintains the efficiency of gradient boosting while eliminating its statistical bias. The key insight is deceptively simple: when computing residuals for a sample, use predictions from a model that has never seen that sample.

The Ordered Boosting Algorithm

Generate a random permutation $\sigma$ of the training indices ${1, 2, ..., n}$
For each boosting iteration $t$:
- For each sample $i$, compute its residual using a model trained only on samples appearing before $i$ in the permutation: $$r_{ti} = -\left[\frac{\partial L(y_i, F)}{\partial F}\right]{F=F{t-1}^{\sigma, i}}$$ where $F_{t-1}^{\sigma, i}$ is trained on ${(x_j, y_j) : \sigma(j) < \sigma(i)}$
- Fit the base learner to these unbiased residuals

The Crucial Difference

In traditional gradient boosting, we compute:

$r_i = g(y_i, F(x_i))$ where $F$ was trained on all data including $(x_i, y_i)$

In ordered boosting, we compute:

$r_i = g(y_i, F^{<i}(x_i))$ where $F^{<i}$ was trained on data excluding $(x_i, y_i)$ and all subsequent samples

Why This Works

By ensuring the model used to predict sample $i$ has never seen $y_i$, we break the feedback loop that causes prediction shift. The residual $r_i$ now reflects genuine error—not training set memorization bias. Each residual is an honest signal about what the model still needs to learn.

The Permutation Ensemble

Using a single permutation would introduce its own bias—samples early in the permutation would have residuals computed from weak models, while later samples would have residuals from strong models. CatBoost addresses this through multiple permutation models:

During training, CatBoost maintains $s$ different permutations ${\sigma_1, ..., \sigma_s}$ (typically $s = 4$). For each tree being added to the ensemble:

One permutation is used for structure learning (determining splits)
Leaf values are computed using a different permutation to avoid overfitting the structure selection

This ensemble of permutations smooths out the variance introduced by any single ordering while maintaining the unbiased residual computation that eliminates prediction shift.

Comparison: Traditional vs Ordered Boosting
Aspect	Traditional Gradient Boosting	CatBoost Ordered Boosting
Residual computation	Uses model trained on all data	Uses model trained on preceding samples only
Information flow	Target $y_i$ influences its own residual	Target $y_i$ never influences its own residual
Training complexity	O(n) model evaluations per iteration	O(n) with incremental model updates
Prediction shift	Systematic bias accumulates	Eliminated by construction
Generalization	Overfits without careful regularization	Better generalization by default
Need for validation set	Essential to detect overfitting	Less critical due to unbiased training

Implementation Details and Computational Efficiency

A naive implementation of ordered boosting would be computationally prohibitive—training $n$ different models to compute $n$ residuals at each iteration would multiply complexity by a factor of $n$. CatBoost employs several sophisticated optimizations to make ordered boosting practical.

Incremental Model Updates

The key insight is that models $F^{<i}$ and $F^{<i+1}$ differ by only one sample. Instead of training from scratch, CatBoost maintains the model incrementally:

For decision tree base learners, this means:

Histograms for split finding are updated incrementally as samples are added
Leaf value statistics are maintained as running sums
Adding sample $i+1$ to the model requires only $O(\log n)$ updates to tree statistics

Oblivious Decision Trees

CatBoost uses oblivious (symmetric) decision trees as base learners, which are crucial for efficient ordered boosting:

All nodes at the same depth use the same split
Tree structure is independent of leaf values
This allows the same tree structure to be used across all ordering prefixes

We'll explore symmetric trees in detail in a later section, but their key property here is: once the tree structure is determined, computing leaf values for any subset of the data is O(n).

ordered_boosting_pseudocode.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def ordered_boosting_iteration(
    X: np.ndarray,
    y: np.ndarray,
    current_ensemble: List[Tree],
    permutation: np.ndarray,
    learning_rate: float
) -> Tree:
    """
    Pseudocode for one iteration of ordered boosting.
    
    Key insight: We compute residuals incrementally, where each sample's
    residual uses only predictions from samples appearing earlier in
    the permutation order.
    """
    n_samples = len(y)
    residuals = np.zeros(n_samples)
    
    # Incremental model maintains predictions as samples are added
    incremental_model = IncrementalTreeEnsemble(current_ensemble)
    
    for i in range(n_samples):
        sample_idx = permutation[i]
        
        # Prediction uses only samples seen so far (indices 0..i-1)
        # This prediction has never seen y[sample_idx]
        pred = incremental_model.predict_single(X[sample_idx])
        
        # Residual is unbiased - no target leakage
        residuals[sample_idx] = y[sample_idx] - pred
        
        # Update incremental model to include this sample
        # This is O(log n) due to symmetric tree structure
        incremental_model.add_sample(X[sample_idx], y[sample_idx])
    
    # Fit new tree to unbiased residuals
    new_tree = fit_oblivious_tree(X, residuals)
    
    # Scale by learning rate
    new_tree.leaf_values *= learning_rate
    
    return new_tree
 
 
class IncrementalTreeEnsemble:
    """
    Efficiently maintains tree predictions as samples are added.
    
    For symmetric trees, structure is fixed - only leaf statistics change.
    Adding a sample updates running statistics in O(tree_depth) per tree.
    """
    
    def __init__(self, trees: List[SymmetricTree]):
        self.trees = trees
        # For each tree and each leaf, maintain sum and count
        self.leaf_sums = [[0.0] * tree.n_leaves for tree in trees]
        self.leaf_counts = [[0] * tree.n_leaves for tree in trees]
    
    def add_sample(self, x: np.ndarray, y: float):
        """Add a sample to the incremental model. O(n_trees * depth)."""
        for t, tree in enumerate(self.trees):
            leaf_idx = tree.get_leaf_index(x)
            self.leaf_sums[t][leaf_idx] += y
            self.leaf_counts[t][leaf_idx] += 1
    
    def predict_single(self, x: np.ndarray) -> float:
        """Predict using current incremental model state."""
        prediction = 0.0
        for t, tree in enumerate(self.trees):
            leaf_idx = tree.get_leaf_index(x)
            if self.leaf_counts[t][leaf_idx] > 0:
                # Leaf value is mean of samples in that leaf so far
                prediction += (self.leaf_sums[t][leaf_idx] / 
                             self.leaf_counts[t][leaf_idx])
        return prediction

Computational Complexity

Thanks to symmetric trees and incremental updates, ordered boosting adds only a constant factor overhead compared to traditional gradient boosting. The complexity per iteration remains O(n × d × T) where n is samples, d is features, and T is tree depth—the same as standard implementations.

Ordered Boosting in Practice

Understanding when ordered boosting provides the most significant advantage helps practitioners make informed decisions about algorithm selection.

When Ordered Boosting Helps Most

Ordered boosting's benefits are most pronounced in scenarios where prediction shift is most severe:

Small datasets (n < 10,000): With fewer samples, each point's influence on the model is larger, making target leakage more impactful.
High model complexity: Deeper trees and more iterations create more opportunities for the model to memorize training noise.
High noise-to-signal ratio: When true signal is weak relative to noise, the bias from prediction shift becomes a larger fraction of total error.
Imbalanced classes: Minority class samples have outsized influence per-sample, amplifying leakage effects.

When the Advantage is Less Pronounced

Conversely, ordered boosting adds complexity with diminishing returns when:

Large datasets (n > 100,000): Each sample has minimal influence on predictions; dilution naturally reduces prediction shift.
Strong regularization: Aggressive early stopping and shallow trees already limit overfitting through other means.
High signal-to-noise: When signal dominates, the bias from prediction shift is small relative to the signal being learned.

catboost_ordered_boosting_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
from catboost import CatBoostClassifier, CatBoostRegressor, Pool
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
 
# Generate synthetic dataset
X, y = make_classification(
    n_samples=5000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# =================================================================
# CatBoost with Ordered Boosting (default for small/medium datasets)
# =================================================================
model_ordered = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    
    # Ordered boosting configuration
    boosting_type='Ordered',  # Explicitly use ordered boosting
    
    # Random seed for permutation reproducibility
    random_seed=42,
    
    # Verbose output during training
    verbose=100,
    
    # Early stopping on validation set
    early_stopping_rounds=50,
)
 
# Train with ordered boosting
model_ordered.fit(
    X_train, y_train,
    eval_set=(X_test, y_test),
    use_best_model=True
)
 
# =================================================================
# CatBoost with Plain Boosting (traditional, faster for large data)
# =================================================================
model_plain = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    
    # Plain (traditional) boosting - faster but with prediction shift
    boosting_type='Plain',
    
    random_seed=42,
    verbose=100,
    early_stopping_rounds=50,
)
 
model_plain.fit(
    X_train, y_train,
    eval_set=(X_test, y_test),
    use_best_model=True
)
 
# =================================================================
# Compare Results
# =================================================================
from sklearn.metrics import accuracy_score, roc_auc_score
 
def evaluate_model(model, X_test, y_test, name):
    preds = model.predict(X_test)
    probs = model.predict_proba(X_test)[:, 1]
    
    print(f"\n{name}:")
    print(f"  Accuracy: {accuracy_score(y_test, preds):.4f}")
    print(f"  ROC-AUC:  {roc_auc_score(y_test, probs):.4f}")
    print(f"  Best iteration: {model.best_iteration_}")
    print(f"  Trees used: {model.tree_count_}")
 
evaluate_model(model_ordered, X_test, y_test, "Ordered Boosting")
evaluate_model(model_plain, X_test, y_test, "Plain Boosting")
 
# =================================================================
# When to Use Each Mode
# =================================================================
"""
USE ORDERED BOOSTING (boosting_type='Ordered'):
- Small to medium datasets (< 100K samples)
- When overfitting is a concern
- When you have limited features to regularize with
- When achieving maximum accuracy matters more than speed
- As a default choice for production quality
 
USE PLAIN BOOSTING (boosting_type='Plain'):
- Large datasets (> 100K samples) where speed matters
- When other regularization is already aggressive
- For fast iteration during exploration/prototyping
- When training time is the bottleneck
 
CatBoost automatically selects ordered boosting for smaller datasets
and plain boosting for larger ones when boosting_type='Auto' (default).
"""

Theoretical Guarantees of Ordered Boosting

CatBoost's ordered boosting comes with formal theoretical guarantees that distinguish it from heuristic regularization approaches. These guarantees provide principled confidence in the algorithm's behavior.

Unbiasedness of Residuals

The core theoretical property is that residuals computed via ordered boosting are unbiased estimators of the true gradient:

$$\mathbb{E}[r_i | x_i] = \mathbb{E}\left[\frac{\partial L(y, F)}{\partial F}\bigg|_{F=F^{<i}(x_i)}\right]$$

This expectation is taken over:

The randomness in $y_i$ given $x_i$
The randomness in the permutation $\sigma$
The randomness in samples appearing before $i$ in the permutation

For traditional gradient boosting, this expectation is biased because $F$ saw $y_i$.

Consistency and Convergence

Under mild regularity conditions, ordered boosting is consistent—as $n \to \infty$ and the number of iterations $T \to \infty$ appropriately:

$$F_T(x) \xrightarrow{P} f^*(x) = \arg\min_f \mathbb{E}[L(Y, f(X))]$$

The convergence rate matches optimal rates for gradient descent in function space, without the systematic bias that slows convergence in traditional boosting.

Practical Implication

The unbiasedness guarantee means you can train more iterations without the usual overfitting worry. While you still need early stopping for efficiency, the model won't suddenly degrade as severely from over-training. This gives you a larger 'safe zone' of iteration counts.

Variance Analysis

While ordered boosting eliminates bias, it introduces variance from the random permutation. This variance-bias tradeoff is favorable in practice:

Variance from permutation: $O(1/n)$ due to averaging effects across samples
Bias eliminated: Can be $O(1)$ in traditional boosting (doesn't vanish with $n$)

The permutation variance is further reduced by:

Using multiple permutations (default: 4 in CatBoost)
Averaging predictions from models trained with different permutations at inference time

Generalization Bounds

For classification with the exponential loss, ordered boosting achieves generalization bounds that scale as:

$$R(F_T) \leq \hat{R}(F_T) + O\left(\sqrt{\frac{T \cdot VC(\mathcal{H})}{n}}\right)$$

where:

$R(F_T)$ is the true risk
$\hat{R}(F_T)$ is the empirical training risk
$T$ is the number of boosting iterations
$VC(\mathcal{H})$ is the VC dimension of the base learner class

Crucially, this bound doesn't require additional correction terms for the prediction shift bias that would appear in analysis of traditional boosting.

Theoretical Properties: Traditional vs Ordered Boosting
Property	Traditional GB	Ordered Boosting
Residual estimator	Biased (conditional on $y_i$)	Unbiased
Bias magnitude	$O(1)$ - doesn't vanish with $n$	Zero by construction
Variance	Lower (deterministic residuals)	Higher but $O(1/n)$
Overall MSE	Dominated by bias for large $T$	Scales optimally with $n, T$
Consistency	Requires careful regularization schedule	Holds under mild conditions
Safe iteration range	Narrow; overfitting risk	Wide; gradual degradation

Summary and Key Takeaways

Ordered boosting represents one of the most significant theoretical advances in gradient boosting since its inception. By formally characterizing and eliminating prediction shift, CatBoost achieves better generalization through principled mechanism design rather than after-the-fact regularization.

Key Takeaways

•Prediction shift is a fundamental flaw in traditional gradient boosting—residuals are biased because they're computed using models that saw the targets.
•Ordered boosting eliminates bias by computing each sample's residual using only a model trained on preceding samples in a random permutation.
•Multiple permutations reduce variance while maintaining unbiasedness, with CatBoost using 4 permutations by default.
•Symmetric (oblivious) trees enable efficient implementation with only constant factor overhead compared to traditional boosting.
•Benefits are most pronounced for small datasets, complex models, and high-noise scenarios where prediction shift is most severe.
•Theoretical guarantees include unbiased residuals, consistency, and improved generalization bounds without ad-hoc bias corrections.

Looking Ahead

Next, we'll explore CatBoost's sophisticated handling of categorical features—another area where it innovated beyond XGBoost and LightGBM. The categorical feature handling builds on the same ordered-processing ideas to compute target encodings without leakage.

1 / 5

Loading learning content...

Machine LearningModern Boosting

CatBoost

LevelAdvanced

Duration90 mins

TopicModern Boosting

1 / 5

Ordered Boosting

The Hidden Flaw in Gradient Boosting

Why This Matters

Understanding the Prediction Shift Problem

To understand prediction shift, we must examine the mechanics of gradient boosting at a granular level. Consider the standard gradient boosting procedure:

Initialize the model: $F_0(x) = \arg\min_c \sum_{i=1}^n L(y_i, c)$
For each boosting iteration $t = 1, 2, ..., T$:
- Compute pseudo-residuals: $r_{ti} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]{F=F{t-1}}$
- Fit a base learner $h_t$ to the residuals $(x_i, r_{ti})$
- Update the model: $F_t(x) = F_{t-1}(x) + \eta h_t(x)$

The critical observation is in step 2: we compute residuals $r_{ti}$ using predictions $F_{t-1}(x_i)$ on the same data points $(x_i, y_i)$ that were used to train $F_{t-1}$.

The Core Problem

A Concrete Example of Prediction Shift

Imagine we have a dataset where the true relationship is $y = f(x) + \epsilon$, with $\epsilon$ being pure noise. An ideal learner would capture $f(x)$ and leave the noise alone.

In traditional gradient boosting:

The first tree $h_1$ fits the data, capturing some of $f(x)$ but also fitting some of the noise $\epsilon_i$ for each sample $i$.
When we compute residuals $r_{2i} = y_i - F_1(x_i)$, these residuals are smaller for samples where $h_1$ happened to fit the noise well (by chance).
The second tree $h_2$ now sees residuals that are systematically biased—they underestimate the true residual for samples that $h_1$ overfit.
This bias propagates and compounds through subsequent iterations.

The result: the ensemble becomes increasingly calibrated to artifacts of how it fit the training noise, rather than the underlying signal.

Prediction Shift Manifestation Across Iterations
Iteration	What Should Happen	What Actually Happens	Consequence
t = 1	Fit signal from data	Fit signal + some training noise	Residuals are biased downward for overfit samples
t = 2	Fit remaining signal	Fit biased residuals from t=1	Bias compounds; noise in t=1 affects t=2's targets
t = 3...T	Progressively refine	Progressively fit training artifacts	Model increasingly specialized to training set
Final	Strong generalization	Apparent low training error	Hidden overfitting; poor test performance