Xgboost - Learning Module

Loading content...

0/245

Second-Order Approximation

The Power of Curvature Information

One of XGBoost's most significant innovations is its use of second-order Taylor approximation for optimizing the loss function. While traditional gradient boosting methods rely only on first-order gradient information, XGBoost incorporates second-order derivatives (Hessians), providing curvature information that enables more precise and efficient optimization.

This approach draws from Newton's method in numerical optimization—a technique known for faster convergence than simple gradient descent. By understanding second-order approximation, you'll grasp why XGBoost trains faster and often achieves better solutions than its predecessors.

What You Will Learn

By the end of this page, you will understand: (1) The mathematical derivation of the second-order Taylor expansion, (2) Why Hessian information improves optimization, (3) How to compute gradients and Hessians for common loss functions, (4) The connection to Newton's method and its advantages, and (5) How XGBoost uses gradient statistics for tree construction.

First-Order vs Second-Order Optimization

Before diving into XGBoost's specific implementation, let's establish the fundamental difference between first-order and second-order optimization methods.

First-Order Methods (Gradient Descent)

First-order methods use only the gradient (first derivative) of the objective function:

$$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$

where:

$\theta$ represents parameters
$\eta$ is the learning rate (step size)
$\nabla L$ is the gradient

The gradient tells us the direction of steepest descent but provides no information about how far to go. The learning rate $\eta$ must be manually tuned—too large causes oscillation, too small causes slow convergence.

Second-Order Methods (Newton's Method)

Second-order methods incorporate the Hessian matrix (second derivatives):

$$\theta_{t+1} = \theta_t - H^{-1} \nabla L(\theta_t)$$

where $H = \nabla^2 L(\theta_t)$ is the Hessian matrix.

The Hessian provides curvature information—how quickly the gradient changes. This allows adaptive step sizes: take larger steps when curvature is low (flat regions) and smaller steps when curvature is high (sharp turns).

First-Order (Gradient Descent)

•Uses only gradient $\nabla L$
•Fixed or decaying learning rate
•Direction only, not magnitude
•Linear convergence rate
•Works for any differentiable loss
•Computationally cheaper per step

Second-Order (Newton's Method)

•Uses gradient $\nabla L$ AND Hessian $H$
•Adaptive step size from curvature
•Both direction and magnitude
•Quadratic convergence near optimum
•Requires twice-differentiable loss
•More expensive per step, fewer steps needed

XGBoost's Hybrid Approach

XGBoost doesn't compute the full Hessian matrix (which would be prohibitively expensive). Instead, it uses a diagonal approximation where each sample contributes scalar gradient (g) and Hessian (h) values. This captures curvature benefits without matrix inversion costs.

Taylor Expansion of the Loss Function

Now let's derive XGBoost's second-order approximation mathematically.

                                ** The Setup**

                            At iteration $t$, we want to add tree $f_t$ to minimize:

                            $$\mathcal{ L } ^ {(t)} = \sum_{ i=1 } ^ { n } l\left(y_i, \hat{ y }_i ^ {(t - 1)} + f_t(x_i) \right) + \Omega(f_t)$$

where $\hat{ y } _i ^ {(t - 1) }$ is the current prediction and $f_t(x_i)$ is the new tree's contribution.

** Taylor Expansion **

    We expand the loss function around $\hat{ y } _i ^ {(t - 1)}$ using a second-order Taylor series:

$$l(y_i, \hat{ y }_i ^ {(t - 1)} + \epsilon) \approx l(y_i, \hat{ y }_i ^ {(t - 1)}) + \frac{ \partial l } { \partial \hat{ y } } \cdot \epsilon + \frac{ 1 } { 2 } \frac{ \partial ^ 2 l } { \partial \hat{ y }^ 2 } \cdot \epsilon ^ 2$$

where $\epsilon = f_t(x_i)$ is the step we're taking.

Defining Gradient and Hessian

Let's define the gradient and Hessian for each sample:

    $$g_i = \frac{ \partial l(y_i, \hat{ y }) } { \partial \hat{ y } } \bigg | _{ \hat{ y } = \hat{ y } _i ^ {(t - 1) }
} $$

$$h_i = \frac{ \partial ^ 2 l(y_i, \hat{ y }) } { \partial \hat{ y }^ 2 } \bigg | _{ \hat{ y } = \hat{ y } _i ^ {(t - 1) }

} $$

Where:

$g_i$ is the ** first - order gradient ** — how the loss changes with prediction
$h_i$ is the ** second - order gradient(Hessian) ** — how the gradient changes with prediction

** The Approximated Objective **

Substituting and removing the constant term $l(y_i, \hat{ y }_i ^ {(t - 1)})$:

$$\tilde{ \mathcal{ L } }^ {(t)} = \sum_{ i = 1 }^ { n } \left[g_i \cdot f_t(x_i) + \frac{ 1 } { 2 } h_i \cdot f_t(x_i) ^ 2 \right]+ \Omega(f_t)$$

This is a ** quadratic function** in $f_t(x_i)$! For any fixed tree structure, we can solve for optimal leaf values analytically."

Why Quadratic Approximation Works

Most common loss functions (MSE, logistic, etc.) have smooth second derivatives. The quadratic Taylor approximation is accurate when the step size f_t(x) is small relative to the loss curvature. XGBoost's learning rate parameter ensures this by shrinking tree predictions.

taylor_approximation_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import matplotlib.pyplot as plt
 
def mse_loss(y_true, y_pred):
    """Mean Squared Error loss."""
    return 0.5 * (y_true - y_pred) ** 2
 
def mse_gradient(y_true, y_pred):
    """First derivative of MSE w.r.t. y_pred."""
    return y_pred - y_true
 
def mse_hessian(y_true, y_pred):
    """Second derivative of MSE w.r.t. y_pred."""
    return 1.0  # Constant for MSE
 
def taylor_approx(l0, g, h, delta):
    """Second-order Taylor approximation of loss."""
    return l0 + g * delta + 0.5 * h * delta ** 2
 
# Example: Compare true loss vs Taylor approximation
y_true = 1.0
y_pred_current = 0.3  # Current prediction
 
# Compute current loss, gradient, Hessian
l0 = mse_loss(y_true, y_pred_current)
g = mse_gradient(y_true, y_pred_current)
h = mse_hessian(y_true, y_pred_current)
 
print("Second-Order Taylor Approximation for MSE Loss")
print("=" * 55)
print(f"True label: y = {y_true}")
print(f"Current prediction: ŷ = {y_pred_current}")
print(f"Current loss: L = {l0:.4f}")
print(f"Gradient: g = ∂L/∂ŷ = {g:.4f}")
print(f"Hessian: h = ∂²L/∂ŷ² = {h:.4f}")
 
# Compare approximation quality for different step sizes
print(f"\nComparing True Loss vs Taylor Approximation:")
print("-" * 55)
print(f"{'Step (δ)':>10} {'True Loss':>12} {'Taylor Approx':>14} {'Error':>10}")
print("-" * 55)
 
deltas = [-0.5, -0.2, -0.1, 0.0, 0.1, 0.2, 0.5, 0.7]
for delta in deltas:
    new_pred = y_pred_current + delta
    true_loss = mse_loss(y_true, new_pred)
    approx_loss = taylor_approx(l0, g, h, delta)
    error = abs(true_loss - approx_loss)
    print(f"{delta:>10.2f} {true_loss:>12.6f} {approx_loss:>14.6f} {error:>10.6f}")
 
# For MSE, Taylor approximation is EXACT (since MSE is quadratic)
print("\n" + "=" * 55)
print("Note: For MSE loss, the Taylor approximation is EXACT")
print("because MSE is already a quadratic function.")
print("For other losses (e.g., logistic), there's approximation error.")

Gradients and Hessians for Common Loss Functions

Different machine learning tasks use different loss functions. Let's derive the gradients and Hessians for the most common ones used with XGBoost.

1. Squared Error (Regression)

$$l(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2$$

Gradient: $$g = \frac{\partial l}{\partial \hat{y}} = -(y - \hat{y}) = \hat{y} - y$$

Hessian: $$h = \frac{\partial^2 l}{\partial \hat{y}^2} = 1$$

The Hessian is constant! This makes squared error loss particularly simple—curvature is uniform everywhere.

2. Logistic Loss (Binary Classification)

For binary classification, XGBoost predicts logits $\hat{y}$, and the probability is: $$p = \sigma(\hat{y}) = \frac{1}{1 + e^{-\hat{y}}}$$

The logistic loss (negative log-likelihood) is: $$l(y, \hat{y}) = -[y \log(p) + (1-y) \log(1-p)]$$

After simplification: $$l(y, \hat{y}) = -y \cdot \hat{y} + \log(1 + e^{\hat{y}})$$

Gradient: $$g = \frac{\partial l}{\partial \hat{y}} = -y + \frac{e^{\hat{y}}}{1 + e^{\hat{y}}} = p - y = \sigma(\hat{y}) - y$$

Hessian: $$h = \frac{\partial^2 l}{\partial \hat{y}^2} = p(1-p) = \sigma(\hat{y})(1 - \sigma(\hat{y}))$$

The Hessian varies: it's maximized when $p = 0.5$ (uncertain predictions) and approaches zero as $p \to 0$ or $p \to 1$ (confident predictions).

Hessian Interpretation for Logistic Loss

For logistic loss, h = p(1-p) is highest at p=0.5 and lowest near p=0 or p=1. Intuitively: when the model is uncertain, the curvature is high (small changes in logit significantly affect the loss). When confident, curvature is low (the model is already near a plateau).

3. Softmax Loss (Multi-class Classification)

For K classes, the softmax probability for class k is: $$p_k = \frac{e^{\hat{y}k}}{\sum{j=1}^{K} e^{\hat{y}_j}}$$

For the true class $k = c$, the loss is: $$l = -\log(p_c)$$

Gradient for class k: $$g_k = p_k - \mathbb{1}[k = c]$$

Hessian for class k (diagonal approximation): $$h_k = p_k(1 - p_k)$$

This is identical in form to binary logistic, applied per-class.

4. Absolute Error (Robust Regression)

$$l(y, \hat{y}) = |y - \hat{y}|$$

Gradient: $$g = -\text{sign}(y - \hat{y})$$

Hessian: $$h = 0 \text{ (undefined at } y = \hat{y}\text{)}$$

This is problematic for XGBoost's second-order method! In practice, XGBoost uses the Huber loss or a smoothed approximation instead.

Summary: Gradients and Hessians for Common Losses
Loss Function	Use Case	Gradient (g)	Hessian (h)
Squared Error	Regression	$\hat{y} - y$	$1$
Logistic	Binary Classification	$p - y$	$p(1-p)$
Softmax	Multi-class	$p_k - \mathbb{1}_{k=c}$	$p_k(1-p_k)$
Huber	Robust Regression	$\hat{y} - y$ or $\pm\delta$	$1$ or $0$
Poisson	Count Data	$e^{\hat{y}} - y$	$e^{\hat{y}}$

gradient_hessian_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
 
class GradientHessian:
    """Compute gradients and Hessians for common loss functions."""
    
    @staticmethod
    def squared_error(y_true, y_pred):
        """
        Squared Error Loss: L = 0.5 * (y - ŷ)²
        
        g = ŷ - y
        h = 1
        """
        g = y_pred - y_true
        h = np.ones_like(y_pred)
        return g, h
    
    @staticmethod
    def logistic(y_true, y_pred_logit):
        """
        Logistic Loss: L = -[y·log(p) + (1-y)·log(1-p)]
        where p = sigmoid(ŷ)
        
        g = p - y
        h = p(1 - p)
        """
        p = 1 / (1 + np.exp(-y_pred_logit))  # sigmoid
        g = p - y_true
        h = p * (1 - p)
        # Clip h to avoid numerical issues
        h = np.clip(h, 1e-6, 1.0)
        return g, h
    
    @staticmethod
    def huber(y_true, y_pred, delta=1.0):
        """
        Huber Loss: Quadratic for small errors, linear for large.
        
        L = 0.5*(y - ŷ)² if |y - ŷ| ≤ δ
            δ*|y - ŷ| - 0.5*δ² otherwise
        
        g = ŷ - y if |residual| ≤ δ, else sign(residual)*δ
        h = 1 if |residual| ≤ δ, else 0 (or small ε)
        """
        residual = y_pred - y_true
        abs_residual = np.abs(residual)
        
        # Gradient
        g = np.where(abs_residual <= delta, 
                     residual, 
                     delta * np.sign(residual))
        
        # Hessian (use small epsilon for numerical stability)
        h = np.where(abs_residual <= delta, 1.0, 1e-6)
        
        return g, h
    
    @staticmethod
    def poisson(y_true, y_pred_log):
        """
        Poisson Loss for count data: L = exp(ŷ) - y·ŷ
        where ŷ is the log of the expected count.
        
        g = exp(ŷ) - y
        h = exp(ŷ)
        """
        exp_pred = np.exp(y_pred_log)
        g = exp_pred - y_true
        h = exp_pred
        return g, h
 
 
# Demonstration
print("Gradient and Hessian Computation Examples")
print("=" * 60)
 
# Example 1: Squared Error
print("\n1. Squared Error (Regression)")
print("-" * 40)
y = np.array([1.0, 2.0, 3.0, 4.0])
y_pred = np.array([1.2, 1.8, 3.5, 3.7])
g, h = GradientHessian.squared_error(y, y_pred)
for i in range(len(y)):
    print(f"  Sample {i}: y={y[i]:.1f}, ŷ={y_pred[i]:.1f}, "
          f"g={g[i]:+.3f}, h={h[i]:.3f}")
 
# Example 2: Logistic Loss
print("\n2. Logistic Loss (Binary Classification)")
print("-" * 40)
y = np.array([0, 0, 1, 1])
logits = np.array([-2.0, -0.5, 0.5, 2.0])
probs = 1 / (1 + np.exp(-logits))
g, h = GradientHessian.logistic(y, logits)
for i in range(len(y)):
    print(f"  Sample {i}: y={y[i]}, logit={logits[i]:+.1f}, "
          f"p={probs[i]:.3f}, g={g[i]:+.3f}, h={h[i]:.3f}")
 
# Example 3: Connection to optimal leaf weight
print("\n3. Computing Optimal Leaf Weight")
print("-" * 40)
# Imagine all 4 samples land in the same leaf
G = np.sum(g)  # Sum of gradients
H = np.sum(h)  # Sum of Hessians
lambda_ = 1.0  # L2 regularization
 
w_optimal = -G / (H + lambda_)
print(f"  Sum of gradients (G): {G:.4f}")
print(f"  Sum of Hessians (H): {H:.4f}")
print(f"  Regularization (λ): {lambda_}")
print(f"  Optimal leaf weight: w* = -G/(H+λ) = {w_optimal:.4f}")

From Approximation to Tree Construction

Now we connect the Taylor approximation to XGBoost's tree construction algorithm.

Reformulating with Tree Structure

Recall that a tree $f(x)$ maps inputs to leaf weights: $$f(x) = w_{q(x)}$$

where $q(x)$ assigns sample $x$ to a leaf index.

Let $I_j = {i : q(x_i) = j}$ be the set of samples in leaf $j$. The approximated objective becomes:

$$\tilde{\mathcal{L}}^{(t)} = \sum_{j=1}^{T} \left[ \left(\sum_{i \in I_j} g_i\right) w_j + \frac{1}{2}\left(\sum_{i \in I_j} h_i + \lambda\right) w_j^2 \right] + \gamma T$$

Defining aggregate statistics for each leaf:

$G_j = \sum_{i \in I_j} g_i$ (sum of gradients)
$H_j = \sum_{i \in I_j} h_i$ (sum of Hessians)

We get:

$$\tilde{\mathcal{L}}^{(t)} = \sum_{j=1}^{T} \left[ G_j w_j + \frac{1}{2}(H_j + \lambda) w_j^2 \right] + \gamma T$$

Optimal Leaf Weights

For a fixed tree structure (fixed $q$), this is a separable quadratic optimization in the leaf weights. Taking the derivative with respect to $w_j$ and setting to zero:

$$\frac{\partial \tilde{\mathcal{L}}}{\partial w_j} = G_j + (H_j + \lambda) w_j = 0$$

Solving: $$w_j^* = -\frac{G_j}{H_j + \lambda}$$

This is the closed-form optimal weight for leaf $j$—no iterative optimization needed!

Optimal Objective Value

Substituting $w_j^*$ back into the objective:

$$\tilde{\mathcal{L}}^{(t)}(q) = -\frac{1}{2} \sum_{j=1}^{T} \frac{G_j^2}{H_j + \lambda} + \gamma T$$

This expression serves as a quality score for the tree structure $q$. Lower values are better (more negative first term means better fit; more leaves add penalty via $\gamma T$).

The Role of the Hessian

In the optimal weight formula w* = -G/(H+λ), the Hessian H appears in the denominator. When H is large (high curvature), the weight is smaller—the algorithm takes conservative steps. When H is small (flat region), larger weights are allowed. This is exactly the Newton's method behavior: scale the gradient by the inverse curvature.

tree_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
 
def compute_tree_score(leaf_data, lambda_, gamma):
    """
    Compute the objective value for a tree structure.
    
    Parameters:
    -----------
    leaf_data : list of (G_j, H_j) tuples for each leaf
    lambda_ : L2 regularization parameter
    gamma : Leaf count penalty
    
    Returns:
    --------
    score : The objective value (lower is better)
    optimal_weights : Optimal weight for each leaf
    """
    T = len(leaf_data)  # Number of leaves
    
    optimal_weights = []
    score_sum = 0
    
    for G_j, H_j in leaf_data:
        # Optimal weight for this leaf
        w_j = -G_j / (H_j + lambda_)
        optimal_weights.append(w_j)
        
        # Contribution to objective
        score_sum += (G_j ** 2) / (H_j + lambda_)
    
    # Full objective
    score = -0.5 * score_sum + gamma * T
    
    return score, optimal_weights
 
# Example: Compare two tree structures
print("Tree Structure Optimization via Second-Order Approximation")
print("=" * 65)
 
# Regularization parameters
lambda_ = 1.0
gamma = 0.5
 
# Tree A: Simple (2 leaves)
# Leaf 1: samples with G=5, H=10
# Leaf 2: samples with G=-3, H=8
tree_A_leaves = [(5.0, 10.0), (-3.0, 8.0)]
score_A, weights_A = compute_tree_score(tree_A_leaves, lambda_, gamma)
 
print(f"\nTree A (2 leaves):")
print(f"  Leaf 1: G={tree_A_leaves[0][0]}, H={tree_A_leaves[0][1]} → w*={weights_A[0]:.4f}")
print(f"  Leaf 2: G={tree_A_leaves[1][0]}, H={tree_A_leaves[1][1]} → w*={weights_A[1]:.4f}")
print(f"  Objective score: {score_A:.4f}")
 
# Tree B: Complex (4 leaves) - Tree A's leaves split further
# Leaf 1a: G=3, H=6
# Leaf 1b: G=2, H=4
# Leaf 2a: G=-2, H=5
# Leaf 2b: G=-1, H=3
tree_B_leaves = [(3.0, 6.0), (2.0, 4.0), (-2.0, 5.0), (-1.0, 3.0)]
score_B, weights_B = compute_tree_score(tree_B_leaves, lambda_, gamma)
 
print(f"\nTree B (4 leaves) - further splits of Tree A:")
for i, (G, H) in enumerate(tree_B_leaves):
    print(f"  Leaf {i+1}: G={G}, H={H} → w*={weights_B[i]:.4f}")
print(f"  Objective score: {score_B:.4f}")
 
# Comparison
print(f"\n{'=' * 65}")
print(f"Score comparison: Tree A = {score_A:.4f}, Tree B = {score_B:.4f}")
if score_A < score_B:
    print("Tree A is better (lower objective) - simpler is preferred!")
else:
    print("Tree B is better (lower objective) - additional splits worthwhile!")
 
# Demonstrate how gamma affects the decision
print(f"\n{'=' * 65}")
print("Effect of gamma on tree preference:")
for gamma_test in [0, 0.5, 1.0, 2.0, 5.0]:
    score_A_test, _ = compute_tree_score(tree_A_leaves, lambda_, gamma_test)
    score_B_test, _ = compute_tree_score(tree_B_leaves, lambda_, gamma_test)
    better = "A" if score_A_test < score_B_test else "B"
    print(f"  γ = {gamma_test:.1f}: Score A = {score_A_test:.4f}, "
          f"Score B = {score_B_test:.4f} → Tree {better} preferred")

Split Gain Derivation

The split gain formula is central to XGBoost's greedy tree construction. Let's derive it rigorously from the quadratic approximation.

The Split Decision

Consider a leaf with aggregate statistics $(G, H)$ containing samples from set $I$. We want to determine if splitting this leaf into left ($I_L$) and right ($I_R$) children improves the objective.

Before Split (Single Leaf)

Objective contribution from this leaf: $$\text{Score}_{\text{before}} = -\frac{1}{2} \frac{G^2}{H + \lambda}$$

Plus the structural penalty: this is already one leaf (included in total).

After Split (Two Leaves)

After splitting:

Left leaf: $G_L = \sum_{i \in I_L} g_i$, $H_L = \sum_{i \in I_L} h_i$
Right leaf: $G_R = \sum_{i \in I_R} g_i$, $H_R = \sum_{i \in I_R} h_i$

Note: $G_L + G_R = G$ and $H_L + H_R = H$ (samples partition exactly).

Objective contribution: $$\text{Score}_{\text{after}} = -\frac{1}{2} \left( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} \right)$$

Plus one additional leaf penalty: $+\gamma$ (we went from 1 to 2 leaves).

The Gain Formula

The gain from the split is the reduction in objective (before - after, accounting for signs):

$$\text{Gain} = \text{Score}{\text{before}} - \text{Score}{\text{after}} - \gamma$$

$$\text{Gain} = -\frac{1}{2} \frac{G^2}{H + \lambda} - \left( -\frac{1}{2} \left( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} \right) \right) - \gamma$$

Simplifying:

$$\boxed{\text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda} \right] - \gamma}$$

Interpretation

Each fraction $\frac{G^2}{H + \lambda}$ can be thought of as the "information content" or "score" of a leaf. The split gain is:

Score of left child + Score of right child
Minus score of original leaf (what we would have if we didn't split)
Minus gamma (penalty for creating a new leaf)

The split is made if and only if Gain > 0.

Second-Order Gain vs First-Order

Traditional gradient boosting (first-order) fits trees to pseudo-residuals without this explicit gain formula. XGBoost's second-order formulation provides a rigorous criterion that accounts for sample curvature—splits in regions with high Hessian (more confidence) are valued differently than splits in low-Hessian regions.

split_gain_comprehensive.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
 
def compute_split_gain(G_L, H_L, G_R, H_R, lambda_, gamma):
    """
    Compute the gain from splitting a leaf.
    
    Gain = 0.5 * [G_L²/(H_L+λ) + G_R²/(H_R+λ) - G²/(H+λ)] - γ
    
    Returns: gain value and component breakdown
    """
    G = G_L + G_R
    H = H_L + H_R
    
    score_L = (G_L ** 2) / (H_L + lambda_)
    score_R = (G_R ** 2) / (H_R + lambda_)
    score_parent = (G ** 2) / (H + lambda_)
    
    gain = 0.5 * (score_L + score_R - score_parent) - gamma
    
    return gain, {
        'score_L': score_L,
        'score_R': score_R,
        'score_parent': score_parent,
        'raw_improvement': 0.5 * (score_L + score_R - score_parent),
        'penalty': gamma
    }
 
def find_best_split(g, h, feature_values, lambda_, gamma):
    """
    Find the best split point for a single feature.
    
    Uses exact greedy algorithm: try every possible split point.
    """
    # Sort by feature value
    sorted_idx = np.argsort(feature_values)
    g_sorted = g[sorted_idx]
    h_sorted = h[sorted_idx]
    
    n = len(g)
    G_total = np.sum(g)
    H_total = np.sum(h)
    
    best_gain = -np.inf
    best_split_idx = -1
    
    # Cumulative sums for efficient computation
    G_left = 0
    H_left = 0
    
    gains = []
    
    # Try each split point
    for i in range(n - 1):
        G_left += g_sorted[i]
        H_left += h_sorted[i]
        G_right = G_total - G_left
        H_right = H_total - H_left
        
        gain, _ = compute_split_gain(G_left, H_left, G_right, H_right, 
                                      lambda_, gamma)
        gains.append((feature_values[sorted_idx[i]], gain))
        
        if gain > best_gain:
            best_gain = gain
            best_split_idx = i
            best_split_value = (feature_values[sorted_idx[i]] + 
                               feature_values[sorted_idx[i+1]]) / 2
    
    return best_gain, best_split_value, gains
 
# Example: Find best split
print("Finding the Best Split Using Second-Order Gain")
print("=" * 60)
 
np.random.seed(42)
 
# Simulate regression data
n_samples = 10
feature = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0])
y_true = np.array([1.2, 1.5, 1.8, 2.5, 3.0, 6.0, 6.5, 7.0, 7.2, 7.5])
y_pred = np.full(n_samples, np.mean(y_true))  # Current: predict mean
 
# Compute gradients and Hessians (MSE loss)
g = y_pred - y_true  # gradient
h = np.ones(n_samples)  # Hessian for MSE
 
print("Data:")
print(f"{'Feature':>10} {'y_true':>10} {'y_pred':>10} {'g':>10} {'h':>10}")
print("-" * 50)
for i in range(n_samples):
    print(f"{feature[i]:>10.1f} {y_true[i]:>10.2f} {y_pred[i]:>10.2f} "
          f"{g[i]:>10.4f} {h[i]:>10.1f}")
 
lambda_ = 1.0
gamma = 0.5
 
best_gain, best_split, all_gains = find_best_split(g, h, feature, lambda_, gamma)
 
print(f"\nRegularization: λ = {lambda_}, γ = {gamma}")
print(f"\nGains for each possible split point:")
print("-" * 40)
for split_val, gain in all_gains:
    marker = " ← BEST" if abs(gain - best_gain) < 1e-6 else ""
    print(f"  Split at feature ≤ {split_val:.1f}: Gain = {gain:+.4f}{marker}")
 
print(f"\nBest split: feature ≤ {best_split:.1f} with gain = {best_gain:.4f}")
 
if best_gain > 0:
    print("✓ Split is beneficial (gain > 0)")
else:
    print("✗ Split is not beneficial (gain ≤ 0)")

Connection to Newton's Method

XGBoost's second-order approximation has a deep connection to Newton's method in numerical optimization. Understanding this connection illuminates why XGBoost converges faster and achieves better solutions.

Newton's Method Recap

For minimizing a function $f(\theta)$, Newton's method uses the update:

$$\theta_{t+1} = \theta_t - \frac{f'(\theta_t)}{f''(\theta_t)}$$

This is equivalent to:

Approximate $f$ with a second-order Taylor series around $\theta_t$
Find the minimum of this quadratic approximation
Move to that minimum

XGBoost as Functional Newton's Method

In XGBoost, we're optimizing in function space rather than parameter space. The "parameter" is the prediction $\hat{y}$, and we're finding the best increment $f_t(x)$ to add.

For each sample, the optimal increment is:

$$f_t(x_i)^* = -\frac{g_i}{h_i}$$

But we can't set arbitrary per-sample values—we're constrained by tree structure. The tree aggregates samples in leaves, giving:

$$w_j^* = -\frac{G_j}{H_j + \lambda} = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}$$

This is exactly a regularized Newton step at the leaf level!

Newton's Method vs XGBoost Correspondence
Newton's Method	XGBoost
Parameter θ	Prediction ŷ (in function space)
First derivative f'(θ)	Gradient g = ∂l/∂ŷ
Second derivative f''(θ)	Hessian h = ∂²l/∂ŷ²
Update: -f'/f''	Leaf weight: -G/(H+λ)
Convergence: quadratic	Fast convergence in practice

Advantages of the Newton-like Approach

Adaptive Step Sizes: Unlike gradient descent with a fixed learning rate, the Hessian provides curvature-aware step sizes. When the loss surface is flat (small h), larger steps are taken; when curved (large h), conservative steps are taken.
Faster Convergence: Newton's method has quadratic convergence near the optimum, while gradient descent has linear convergence. In practice, this means fewer boosting iterations are needed.
Scale Invariance: The ratio $g/h$ is scale-invariant in certain respects—if you scale your target variable, the optimal weights adjust appropriately.
Loss-Agnostic Framework: By computing $g$ and $h$ for any twice-differentiable loss, the same tree-building algorithm works for regression, classification, and custom objectives.

Why XGBoost Beats First-Order Boosting

Traditional gradient boosting is like gradient descent: it uses only the gradient direction, requiring careful learning rate tuning. XGBoost is like Newton's method: it uses curvature to take appropriately-sized steps. This is why XGBoost often needs fewer trees and less tuning to achieve good results.

newton_vs_gradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
import matplotlib.pyplot as plt
 
def gradient_descent(gradient_fn, hessian_fn, x0, lr, n_iters):
    """Gradient Descent: x_{t+1} = x_t - lr * gradient(x_t)"""
    x = x0
    trajectory = [x]
    for _ in range(n_iters):
        x = x - lr * gradient_fn(x)
        trajectory.append(x)
    return np.array(trajectory)
 
def newton_method(gradient_fn, hessian_fn, x0, n_iters):
    """Newton's Method: x_{t+1} = x_t - gradient(x_t) / hessian(x_t)"""
    x = x0
    trajectory = [x]
    for _ in range(n_iters):
        g = gradient_fn(x)
        h = hessian_fn(x)
        x = x - g / h
        trajectory.append(x)
    return np.array(trajectory)
 
# Example: Minimize f(x) = x^2 (simple quadratic)
# Gradient: f'(x) = 2x
# Hessian: f''(x) = 2
 
gradient_fn = lambda x: 2 * x
hessian_fn = lambda x: 2
 
x0 = 5.0  # Start far from optimum (x* = 0)
n_iters = 10
 
# Run both methods
gd_trajectory = gradient_descent(gradient_fn, hessian_fn, x0, lr=0.3, n_iters=n_iters)
newton_trajectory = newton_method(gradient_fn, hessian_fn, x0, n_iters=n_iters)
 
print("Comparison: Gradient Descent vs Newton's Method")
print("=" * 60)
print(f"Objective: f(x) = x², minimum at x* = 0")
print(f"Starting point: x0 = {x0}")
print()
print(f"{'Iteration':>10} {'GD (lr=0.3)':>15} {'Newton':>15}")
print("-" * 45)
for i in range(min(len(gd_trajectory), len(newton_trajectory))):
    print(f"{i:>10} {gd_trajectory[i]:>15.6f} {newton_trajectory[i]:>15.6f}")
 
print("\n" + "=" * 60)
print("Newton's method converges in 1 iteration for quadratics!")
print("This is analogous to XGBoost's per-leaf optimal weight.")
 
# For non-quadratic: f(x) = log(1 + exp(x)) (like logistic loss)
print("\n" + "=" * 60)
print("Non-quadratic example: f(x) = log(1 + exp(x))")
print("=" * 60)
 
softplus = lambda x: np.log(1 + np.exp(np.clip(x, -500, 500)))
softplus_grad = lambda x: 1 / (1 + np.exp(-x))  # sigmoid
softplus_hess = lambda x: (np.exp(x)) / (1 + np.exp(x))**2  # sigmoid * (1-sigmoid)
 
x0 = 3.0
n_iters = 15
 
gd_trajectory2 = gradient_descent(softplus_grad, softplus_hess, x0, lr=0.5, n_iters=n_iters)
newton_trajectory2 = newton_method(softplus_grad, softplus_hess, x0, n_iters=n_iters)
 
print(f"Starting point: x0 = {x0}")
print(f"Optimum at x* → -∞ (but plateaus)")
print()
print(f"{'Iter':>5} {'f(x) GD':>15} {'f(x) Newton':>15}")
print("-" * 40)
for i in range(min(8, len(gd_trajectory2), len(newton_trajectory2))):
    print(f"{i:>5} {softplus(gd_trajectory2[i]):>15.6f} "
          f"{softplus(newton_trajectory2[i]):>15.6f}")
 
print("\nNewton converges faster even for non-quadratic losses!")

Practical Implications of Second-Order Approximation

Understanding the second-order approximation has practical implications for how you use XGBoost effectively.

1. Custom Loss Functions

To use a custom loss function in XGBoost, you must provide both gradient and Hessian functions:

custom_loss_xgboost.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import xgboost as xgb
import numpy as np
 
def custom_huber_objective(y_pred, dtrain, delta=1.0):
    """
    Custom Huber loss objective for XGBoost.
    
    Huber loss is quadratic for small residuals, linear for large.
    This makes it robust to outliers.
    
    Must return (gradient, hessian) arrays.
    """
    y_true = dtrain.get_label()
    residual = y_pred - y_true
    abs_residual = np.abs(residual)
    
    # Gradient
    # g = residual if |residual| <= delta
    # g = delta * sign(residual) if |residual| > delta
    grad = np.where(
        abs_residual <= delta,
        residual,
        delta * np.sign(residual)
    )
    
    # Hessian  
    # h = 1 if |residual| <= delta
    # h = 0 (or small epsilon) if |residual| > delta
    hess = np.where(
        abs_residual <= delta,
        np.ones_like(residual),
        np.full_like(residual, 1e-6)  # Small epsilon for numerical stability
    )
    
    return grad, hess
 
def custom_huber_eval(y_pred, dtrain, delta=1.0):
    """Evaluation metric for Huber loss."""
    y_true = dtrain.get_label()
    residual = np.abs(y_pred - y_true)
    
    loss = np.where(
        residual <= delta,
        0.5 * residual ** 2,
        delta * residual - 0.5 * delta ** 2
    )
    
    return 'huber_loss', float(np.mean(loss))
 
# Usage example
# X_train, y_train = ...  # Your data
# dtrain = xgb.DMatrix(X_train, label=y_train)
 
# model = xgb.train(
#     params={'tree_method': 'hist'},
#     dtrain=dtrain,
#     num_boost_round=100,
#     obj=lambda preds, dtrain: custom_huber_objective(preds, dtrain, delta=1.0),
#     feval=lambda preds, dtrain: custom_huber_eval(preds, dtrain, delta=1.0)
# )
 
print("Custom Loss Function Example: Huber Loss")
print("=" * 50)
print("To use custom losses in XGBoost, you must provide:")
print("  1. Gradient: ∂L/∂ŷ for each sample")
print("  2. Hessian: ∂²L/∂ŷ² for each sample")
print()
print("Common custom losses:")
print("  - Huber (robust regression)")
print("  - Focal (class imbalance)")
print("  - Quantile (prediction intervals)")
print("  - Custom business metric)

** 2. Understanding Sample Weights Through Hessians **

    For logistic loss, the Hessian $h_i = p_i(1 - p_i)$ acts as an implicit sample weight:
- Samples near the decision boundary($p \approx 0.5$): high $h$, high influence
    - Confident predictions($p \approx 0$ or $p \approx 1$): low $h$, low influence

This means XGBoost automatically focuses on uncertain samples—similar to boosting's focus on hard examples, but arising naturally from the loss curvature.

    ** 3. Minimum Hessian Sum(min_child_weight) **

        XGBoost has a parameter `min_child_weight` (default=1) which requires:

$$H_{\text{child}} \geq \text{min_child_weight}$$

For a split to be valid, both children must have sufficient Hessian sum. This prevents:

Overfitting to individual samples (each sample's h is small)
Numerical instability from near-zero denominators

For MSE loss where $h=1$, min_child_weight=1 means each leaf needs at least 1 sample. For logistic loss, the interpretation is more nuanced—it's about summed confidence, not sample count.

Practical Takeaways

•Custom losses require both g and h — You must provide gradients AND Hessians for custom objectives
•Hessians act as sample weights — Samples with higher curvature (uncertainty) have more influence on splits
•min_child_weight controls leaf robustness — Increase it to prevent overfitting to small subsets
•Different losses have different Hessian patterns — MSE has constant h=1; logistic has h varying by confidence
•Numerical stability matters — Avoid zero Hessians in custom losses; use small epsilon if needed

Hessian Must Be Positive

For the optimization to work correctly, Hessians must be positive (strictly, for convex losses). If you're implementing a custom loss and get NaN or unexpected behavior, check that your Hessian is always positive. A common fix is to use max(h, epsilon) for some small epsilon like 1e-6.

Summary: Second-Order Approximation

We have thoroughly explored XGBoost's second-order Taylor approximation—the mathematical technique that enables its efficient and accurate optimization.

Key Takeaways

•XGBoost uses second-order Taylor expansion of the loss: $l(y, \hat{y} + \epsilon) \approx l + g\epsilon + \frac{1}{2}h\epsilon^2$
•Gradients (g) and Hessians (h) are computed per-sample from the loss function
•Optimal leaf weights have closed form: $w^* = -G/(H + \lambda)$
•The split gain formula naturally incorporates both gradients and Hessians
•This is Newton's method in function space — using curvature for adaptive step sizes
•Faster convergence than first-order gradient boosting
•Custom losses require both g and h — understanding them is essential for advanced usage

** What's Next**

With the optimization framework understood, we'll explore XGBoost's ** efficient split finding algorithms **.While the gain formula tells us how good a split is, actually finding the best split among all features and thresholds is computationally challenging.XGBoost introduces several innovations—including approximate algorithms and histogram - based methods—that make this tractable for large datasets.

Page Complete

You now understand XGBoost's second-order approximation—the Newton-like optimization that distinguishes it from traditional gradient boosting. This foundation in gradients and Hessians will be essential as we explore split finding, and you'll use these concepts whenever implementing custom loss functions.