Loading content...
One of XGBoost's most significant innovations is its use of second-order Taylor approximation for optimizing the loss function. While traditional gradient boosting methods rely only on first-order gradient information, XGBoost incorporates second-order derivatives (Hessians), providing curvature information that enables more precise and efficient optimization.
This approach draws from Newton's method in numerical optimization—a technique known for faster convergence than simple gradient descent. By understanding second-order approximation, you'll grasp why XGBoost trains faster and often achieves better solutions than its predecessors.
By the end of this page, you will understand: (1) The mathematical derivation of the second-order Taylor expansion, (2) Why Hessian information improves optimization, (3) How to compute gradients and Hessians for common loss functions, (4) The connection to Newton's method and its advantages, and (5) How XGBoost uses gradient statistics for tree construction.
Before diving into XGBoost's specific implementation, let's establish the fundamental difference between first-order and second-order optimization methods.
First-Order Methods (Gradient Descent)
First-order methods use only the gradient (first derivative) of the objective function:
$$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$
where:
The gradient tells us the direction of steepest descent but provides no information about how far to go. The learning rate $\eta$ must be manually tuned—too large causes oscillation, too small causes slow convergence.
Second-Order Methods (Newton's Method)
Second-order methods incorporate the Hessian matrix (second derivatives):
$$\theta_{t+1} = \theta_t - H^{-1} \nabla L(\theta_t)$$
where $H = \nabla^2 L(\theta_t)$ is the Hessian matrix.
The Hessian provides curvature information—how quickly the gradient changes. This allows adaptive step sizes: take larger steps when curvature is low (flat regions) and smaller steps when curvature is high (sharp turns).
XGBoost doesn't compute the full Hessian matrix (which would be prohibitively expensive). Instead, it uses a diagonal approximation where each sample contributes scalar gradient (g) and Hessian (h) values. This captures curvature benefits without matrix inversion costs.
Now let's derive XGBoost's second-order approximation mathematically.
** The Setup**
At iteration $t$, we want to add tree $f_t$ to minimize:
$$\mathcal{ L } ^ {(t)} = \sum_{ i=1 } ^ { n } l\left(y_i, \hat{ y }_i ^ {(t - 1)} + f_t(x_i) \right) + \Omega(f_t)$$
where $\hat{ y } _i ^ {(t - 1) }$ is the current prediction and $f_t(x_i)$ is the new tree's contribution.
** Taylor Expansion **
We expand the loss function around $\hat{ y } _i ^ {(t - 1)}$ using a second-order Taylor series:
$$l(y_i, \hat{ y }_i ^ {(t - 1)} + \epsilon) \approx l(y_i, \hat{ y }_i ^ {(t - 1)}) + \frac{ \partial l } { \partial \hat{ y } } \cdot \epsilon + \frac{ 1 } { 2 } \frac{ \partial ^ 2 l } { \partial \hat{ y }^ 2 } \cdot \epsilon ^ 2$$
where $\epsilon = f_t(x_i)$ is the step we're taking.
Defining Gradient and Hessian
Let's define the gradient and Hessian for each sample:
$$g_i = \frac{ \partial l(y_i, \hat{ y }) } { \partial \hat{ y } } \bigg | _{ \hat{ y } = \hat{ y } _i ^ {(t - 1) }
} $$
$$h_i = \frac{ \partial ^ 2 l(y_i, \hat{ y }) } { \partial \hat{ y }^ 2 } \bigg | _{ \hat{ y } = \hat{ y } _i ^ {(t - 1) }
} $$
Where:
** The Approximated Objective **
Substituting and removing the constant term $l(y_i, \hat{ y }_i ^ {(t - 1)})$:
$$\tilde{ \mathcal{ L } }^ {(t)} = \sum_{ i = 1 }^ { n } \left[g_i \cdot f_t(x_i) + \frac{ 1 } { 2 } h_i \cdot f_t(x_i) ^ 2 \right]+ \Omega(f_t)$$
This is a ** quadratic function** in $f_t(x_i)$! For any fixed tree structure, we can solve for optimal leaf values analytically."
Most common loss functions (MSE, logistic, etc.) have smooth second derivatives. The quadratic Taylor approximation is accurate when the step size f_t(x) is small relative to the loss curvature. XGBoost's learning rate parameter ensures this by shrinking tree predictions.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as npimport matplotlib.pyplot as plt def mse_loss(y_true, y_pred): """Mean Squared Error loss.""" return 0.5 * (y_true - y_pred) ** 2 def mse_gradient(y_true, y_pred): """First derivative of MSE w.r.t. y_pred.""" return y_pred - y_true def mse_hessian(y_true, y_pred): """Second derivative of MSE w.r.t. y_pred.""" return 1.0 # Constant for MSE def taylor_approx(l0, g, h, delta): """Second-order Taylor approximation of loss.""" return l0 + g * delta + 0.5 * h * delta ** 2 # Example: Compare true loss vs Taylor approximationy_true = 1.0y_pred_current = 0.3 # Current prediction # Compute current loss, gradient, Hessianl0 = mse_loss(y_true, y_pred_current)g = mse_gradient(y_true, y_pred_current)h = mse_hessian(y_true, y_pred_current) print("Second-Order Taylor Approximation for MSE Loss")print("=" * 55)print(f"True label: y = {y_true}")print(f"Current prediction: ŷ = {y_pred_current}")print(f"Current loss: L = {l0:.4f}")print(f"Gradient: g = ∂L/∂ŷ = {g:.4f}")print(f"Hessian: h = ∂²L/∂ŷ² = {h:.4f}") # Compare approximation quality for different step sizesprint(f"\nComparing True Loss vs Taylor Approximation:")print("-" * 55)print(f"{'Step (δ)':>10} {'True Loss':>12} {'Taylor Approx':>14} {'Error':>10}")print("-" * 55) deltas = [-0.5, -0.2, -0.1, 0.0, 0.1, 0.2, 0.5, 0.7]for delta in deltas: new_pred = y_pred_current + delta true_loss = mse_loss(y_true, new_pred) approx_loss = taylor_approx(l0, g, h, delta) error = abs(true_loss - approx_loss) print(f"{delta:>10.2f} {true_loss:>12.6f} {approx_loss:>14.6f} {error:>10.6f}") # For MSE, Taylor approximation is EXACT (since MSE is quadratic)print("\n" + "=" * 55)print("Note: For MSE loss, the Taylor approximation is EXACT")print("because MSE is already a quadratic function.")print("For other losses (e.g., logistic), there's approximation error.")Different machine learning tasks use different loss functions. Let's derive the gradients and Hessians for the most common ones used with XGBoost.
1. Squared Error (Regression)
$$l(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2$$
Gradient: $$g = \frac{\partial l}{\partial \hat{y}} = -(y - \hat{y}) = \hat{y} - y$$
Hessian: $$h = \frac{\partial^2 l}{\partial \hat{y}^2} = 1$$
The Hessian is constant! This makes squared error loss particularly simple—curvature is uniform everywhere.
2. Logistic Loss (Binary Classification)
For binary classification, XGBoost predicts logits $\hat{y}$, and the probability is: $$p = \sigma(\hat{y}) = \frac{1}{1 + e^{-\hat{y}}}$$
The logistic loss (negative log-likelihood) is: $$l(y, \hat{y}) = -[y \log(p) + (1-y) \log(1-p)]$$
After simplification: $$l(y, \hat{y}) = -y \cdot \hat{y} + \log(1 + e^{\hat{y}})$$
Gradient: $$g = \frac{\partial l}{\partial \hat{y}} = -y + \frac{e^{\hat{y}}}{1 + e^{\hat{y}}} = p - y = \sigma(\hat{y}) - y$$
Hessian: $$h = \frac{\partial^2 l}{\partial \hat{y}^2} = p(1-p) = \sigma(\hat{y})(1 - \sigma(\hat{y}))$$
The Hessian varies: it's maximized when $p = 0.5$ (uncertain predictions) and approaches zero as $p \to 0$ or $p \to 1$ (confident predictions).
For logistic loss, h = p(1-p) is highest at p=0.5 and lowest near p=0 or p=1. Intuitively: when the model is uncertain, the curvature is high (small changes in logit significantly affect the loss). When confident, curvature is low (the model is already near a plateau).
3. Softmax Loss (Multi-class Classification)
For K classes, the softmax probability for class k is: $$p_k = \frac{e^{\hat{y}k}}{\sum{j=1}^{K} e^{\hat{y}_j}}$$
For the true class $k = c$, the loss is: $$l = -\log(p_c)$$
Gradient for class k: $$g_k = p_k - \mathbb{1}[k = c]$$
Hessian for class k (diagonal approximation): $$h_k = p_k(1 - p_k)$$
This is identical in form to binary logistic, applied per-class.
4. Absolute Error (Robust Regression)
$$l(y, \hat{y}) = |y - \hat{y}|$$
Gradient: $$g = -\text{sign}(y - \hat{y})$$
Hessian: $$h = 0 \text{ (undefined at } y = \hat{y}\text{)}$$
This is problematic for XGBoost's second-order method! In practice, XGBoost uses the Huber loss or a smoothed approximation instead.
| Loss Function | Use Case | Gradient (g) | Hessian (h) |
|---|---|---|---|
| Squared Error | Regression | $\hat{y} - y$ | $1$ |
| Logistic | Binary Classification | $p - y$ | $p(1-p)$ |
| Softmax | Multi-class | $p_k - \mathbb{1}_{k=c}$ | $p_k(1-p_k)$ |
| Huber | Robust Regression | $\hat{y} - y$ or $\pm\delta$ | $1$ or $0$ |
| Poisson | Count Data | $e^{\hat{y}} - y$ | $e^{\hat{y}}$ |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
import numpy as np class GradientHessian: """Compute gradients and Hessians for common loss functions.""" @staticmethod def squared_error(y_true, y_pred): """ Squared Error Loss: L = 0.5 * (y - ŷ)² g = ŷ - y h = 1 """ g = y_pred - y_true h = np.ones_like(y_pred) return g, h @staticmethod def logistic(y_true, y_pred_logit): """ Logistic Loss: L = -[y·log(p) + (1-y)·log(1-p)] where p = sigmoid(ŷ) g = p - y h = p(1 - p) """ p = 1 / (1 + np.exp(-y_pred_logit)) # sigmoid g = p - y_true h = p * (1 - p) # Clip h to avoid numerical issues h = np.clip(h, 1e-6, 1.0) return g, h @staticmethod def huber(y_true, y_pred, delta=1.0): """ Huber Loss: Quadratic for small errors, linear for large. L = 0.5*(y - ŷ)² if |y - ŷ| ≤ δ δ*|y - ŷ| - 0.5*δ² otherwise g = ŷ - y if |residual| ≤ δ, else sign(residual)*δ h = 1 if |residual| ≤ δ, else 0 (or small ε) """ residual = y_pred - y_true abs_residual = np.abs(residual) # Gradient g = np.where(abs_residual <= delta, residual, delta * np.sign(residual)) # Hessian (use small epsilon for numerical stability) h = np.where(abs_residual <= delta, 1.0, 1e-6) return g, h @staticmethod def poisson(y_true, y_pred_log): """ Poisson Loss for count data: L = exp(ŷ) - y·ŷ where ŷ is the log of the expected count. g = exp(ŷ) - y h = exp(ŷ) """ exp_pred = np.exp(y_pred_log) g = exp_pred - y_true h = exp_pred return g, h # Demonstrationprint("Gradient and Hessian Computation Examples")print("=" * 60) # Example 1: Squared Errorprint("\n1. Squared Error (Regression)")print("-" * 40)y = np.array([1.0, 2.0, 3.0, 4.0])y_pred = np.array([1.2, 1.8, 3.5, 3.7])g, h = GradientHessian.squared_error(y, y_pred)for i in range(len(y)): print(f" Sample {i}: y={y[i]:.1f}, ŷ={y_pred[i]:.1f}, " f"g={g[i]:+.3f}, h={h[i]:.3f}") # Example 2: Logistic Lossprint("\n2. Logistic Loss (Binary Classification)")print("-" * 40)y = np.array([0, 0, 1, 1])logits = np.array([-2.0, -0.5, 0.5, 2.0])probs = 1 / (1 + np.exp(-logits))g, h = GradientHessian.logistic(y, logits)for i in range(len(y)): print(f" Sample {i}: y={y[i]}, logit={logits[i]:+.1f}, " f"p={probs[i]:.3f}, g={g[i]:+.3f}, h={h[i]:.3f}") # Example 3: Connection to optimal leaf weightprint("\n3. Computing Optimal Leaf Weight")print("-" * 40)# Imagine all 4 samples land in the same leafG = np.sum(g) # Sum of gradientsH = np.sum(h) # Sum of Hessianslambda_ = 1.0 # L2 regularization w_optimal = -G / (H + lambda_)print(f" Sum of gradients (G): {G:.4f}")print(f" Sum of Hessians (H): {H:.4f}")print(f" Regularization (λ): {lambda_}")print(f" Optimal leaf weight: w* = -G/(H+λ) = {w_optimal:.4f}")Now we connect the Taylor approximation to XGBoost's tree construction algorithm.
Reformulating with Tree Structure
Recall that a tree $f(x)$ maps inputs to leaf weights: $$f(x) = w_{q(x)}$$
where $q(x)$ assigns sample $x$ to a leaf index.
Let $I_j = {i : q(x_i) = j}$ be the set of samples in leaf $j$. The approximated objective becomes:
$$\tilde{\mathcal{L}}^{(t)} = \sum_{j=1}^{T} \left[ \left(\sum_{i \in I_j} g_i\right) w_j + \frac{1}{2}\left(\sum_{i \in I_j} h_i + \lambda\right) w_j^2 \right] + \gamma T$$
Defining aggregate statistics for each leaf:
We get:
$$\tilde{\mathcal{L}}^{(t)} = \sum_{j=1}^{T} \left[ G_j w_j + \frac{1}{2}(H_j + \lambda) w_j^2 \right] + \gamma T$$
Optimal Leaf Weights
For a fixed tree structure (fixed $q$), this is a separable quadratic optimization in the leaf weights. Taking the derivative with respect to $w_j$ and setting to zero:
$$\frac{\partial \tilde{\mathcal{L}}}{\partial w_j} = G_j + (H_j + \lambda) w_j = 0$$
Solving: $$w_j^* = -\frac{G_j}{H_j + \lambda}$$
This is the closed-form optimal weight for leaf $j$—no iterative optimization needed!
Optimal Objective Value
Substituting $w_j^*$ back into the objective:
$$\tilde{\mathcal{L}}^{(t)}(q) = -\frac{1}{2} \sum_{j=1}^{T} \frac{G_j^2}{H_j + \lambda} + \gamma T$$
This expression serves as a quality score for the tree structure $q$. Lower values are better (more negative first term means better fit; more leaves add penalty via $\gamma T$).
In the optimal weight formula w* = -G/(H+λ), the Hessian H appears in the denominator. When H is large (high curvature), the weight is smaller—the algorithm takes conservative steps. When H is small (flat region), larger weights are allowed. This is exactly the Newton's method behavior: scale the gradient by the inverse curvature.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as np def compute_tree_score(leaf_data, lambda_, gamma): """ Compute the objective value for a tree structure. Parameters: ----------- leaf_data : list of (G_j, H_j) tuples for each leaf lambda_ : L2 regularization parameter gamma : Leaf count penalty Returns: -------- score : The objective value (lower is better) optimal_weights : Optimal weight for each leaf """ T = len(leaf_data) # Number of leaves optimal_weights = [] score_sum = 0 for G_j, H_j in leaf_data: # Optimal weight for this leaf w_j = -G_j / (H_j + lambda_) optimal_weights.append(w_j) # Contribution to objective score_sum += (G_j ** 2) / (H_j + lambda_) # Full objective score = -0.5 * score_sum + gamma * T return score, optimal_weights # Example: Compare two tree structuresprint("Tree Structure Optimization via Second-Order Approximation")print("=" * 65) # Regularization parameterslambda_ = 1.0gamma = 0.5 # Tree A: Simple (2 leaves)# Leaf 1: samples with G=5, H=10# Leaf 2: samples with G=-3, H=8tree_A_leaves = [(5.0, 10.0), (-3.0, 8.0)]score_A, weights_A = compute_tree_score(tree_A_leaves, lambda_, gamma) print(f"\nTree A (2 leaves):")print(f" Leaf 1: G={tree_A_leaves[0][0]}, H={tree_A_leaves[0][1]} → w*={weights_A[0]:.4f}")print(f" Leaf 2: G={tree_A_leaves[1][0]}, H={tree_A_leaves[1][1]} → w*={weights_A[1]:.4f}")print(f" Objective score: {score_A:.4f}") # Tree B: Complex (4 leaves) - Tree A's leaves split further# Leaf 1a: G=3, H=6# Leaf 1b: G=2, H=4# Leaf 2a: G=-2, H=5# Leaf 2b: G=-1, H=3tree_B_leaves = [(3.0, 6.0), (2.0, 4.0), (-2.0, 5.0), (-1.0, 3.0)]score_B, weights_B = compute_tree_score(tree_B_leaves, lambda_, gamma) print(f"\nTree B (4 leaves) - further splits of Tree A:")for i, (G, H) in enumerate(tree_B_leaves): print(f" Leaf {i+1}: G={G}, H={H} → w*={weights_B[i]:.4f}")print(f" Objective score: {score_B:.4f}") # Comparisonprint(f"\n{'=' * 65}")print(f"Score comparison: Tree A = {score_A:.4f}, Tree B = {score_B:.4f}")if score_A < score_B: print("Tree A is better (lower objective) - simpler is preferred!")else: print("Tree B is better (lower objective) - additional splits worthwhile!") # Demonstrate how gamma affects the decisionprint(f"\n{'=' * 65}")print("Effect of gamma on tree preference:")for gamma_test in [0, 0.5, 1.0, 2.0, 5.0]: score_A_test, _ = compute_tree_score(tree_A_leaves, lambda_, gamma_test) score_B_test, _ = compute_tree_score(tree_B_leaves, lambda_, gamma_test) better = "A" if score_A_test < score_B_test else "B" print(f" γ = {gamma_test:.1f}: Score A = {score_A_test:.4f}, " f"Score B = {score_B_test:.4f} → Tree {better} preferred")The split gain formula is central to XGBoost's greedy tree construction. Let's derive it rigorously from the quadratic approximation.
The Split Decision
Consider a leaf with aggregate statistics $(G, H)$ containing samples from set $I$. We want to determine if splitting this leaf into left ($I_L$) and right ($I_R$) children improves the objective.
Before Split (Single Leaf)
Objective contribution from this leaf: $$\text{Score}_{\text{before}} = -\frac{1}{2} \frac{G^2}{H + \lambda}$$
Plus the structural penalty: this is already one leaf (included in total).
After Split (Two Leaves)
After splitting:
Note: $G_L + G_R = G$ and $H_L + H_R = H$ (samples partition exactly).
Objective contribution: $$\text{Score}_{\text{after}} = -\frac{1}{2} \left( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} \right)$$
Plus one additional leaf penalty: $+\gamma$ (we went from 1 to 2 leaves).
The Gain Formula
The gain from the split is the reduction in objective (before - after, accounting for signs):
$$\text{Gain} = \text{Score}{\text{before}} - \text{Score}{\text{after}} - \gamma$$
$$\text{Gain} = -\frac{1}{2} \frac{G^2}{H + \lambda} - \left( -\frac{1}{2} \left( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} \right) \right) - \gamma$$
Simplifying:
$$\boxed{\text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda} \right] - \gamma}$$
Interpretation
Each fraction $\frac{G^2}{H + \lambda}$ can be thought of as the "information content" or "score" of a leaf. The split gain is:
The split is made if and only if Gain > 0.
Traditional gradient boosting (first-order) fits trees to pseudo-residuals without this explicit gain formula. XGBoost's second-order formulation provides a rigorous criterion that accounts for sample curvature—splits in regions with high Hessian (more confidence) are valued differently than splits in low-Hessian regions.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
import numpy as np def compute_split_gain(G_L, H_L, G_R, H_R, lambda_, gamma): """ Compute the gain from splitting a leaf. Gain = 0.5 * [G_L²/(H_L+λ) + G_R²/(H_R+λ) - G²/(H+λ)] - γ Returns: gain value and component breakdown """ G = G_L + G_R H = H_L + H_R score_L = (G_L ** 2) / (H_L + lambda_) score_R = (G_R ** 2) / (H_R + lambda_) score_parent = (G ** 2) / (H + lambda_) gain = 0.5 * (score_L + score_R - score_parent) - gamma return gain, { 'score_L': score_L, 'score_R': score_R, 'score_parent': score_parent, 'raw_improvement': 0.5 * (score_L + score_R - score_parent), 'penalty': gamma } def find_best_split(g, h, feature_values, lambda_, gamma): """ Find the best split point for a single feature. Uses exact greedy algorithm: try every possible split point. """ # Sort by feature value sorted_idx = np.argsort(feature_values) g_sorted = g[sorted_idx] h_sorted = h[sorted_idx] n = len(g) G_total = np.sum(g) H_total = np.sum(h) best_gain = -np.inf best_split_idx = -1 # Cumulative sums for efficient computation G_left = 0 H_left = 0 gains = [] # Try each split point for i in range(n - 1): G_left += g_sorted[i] H_left += h_sorted[i] G_right = G_total - G_left H_right = H_total - H_left gain, _ = compute_split_gain(G_left, H_left, G_right, H_right, lambda_, gamma) gains.append((feature_values[sorted_idx[i]], gain)) if gain > best_gain: best_gain = gain best_split_idx = i best_split_value = (feature_values[sorted_idx[i]] + feature_values[sorted_idx[i+1]]) / 2 return best_gain, best_split_value, gains # Example: Find best splitprint("Finding the Best Split Using Second-Order Gain")print("=" * 60) np.random.seed(42) # Simulate regression datan_samples = 10feature = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0])y_true = np.array([1.2, 1.5, 1.8, 2.5, 3.0, 6.0, 6.5, 7.0, 7.2, 7.5])y_pred = np.full(n_samples, np.mean(y_true)) # Current: predict mean # Compute gradients and Hessians (MSE loss)g = y_pred - y_true # gradienth = np.ones(n_samples) # Hessian for MSE print("Data:")print(f"{'Feature':>10} {'y_true':>10} {'y_pred':>10} {'g':>10} {'h':>10}")print("-" * 50)for i in range(n_samples): print(f"{feature[i]:>10.1f} {y_true[i]:>10.2f} {y_pred[i]:>10.2f} " f"{g[i]:>10.4f} {h[i]:>10.1f}") lambda_ = 1.0gamma = 0.5 best_gain, best_split, all_gains = find_best_split(g, h, feature, lambda_, gamma) print(f"\nRegularization: λ = {lambda_}, γ = {gamma}")print(f"\nGains for each possible split point:")print("-" * 40)for split_val, gain in all_gains: marker = " ← BEST" if abs(gain - best_gain) < 1e-6 else "" print(f" Split at feature ≤ {split_val:.1f}: Gain = {gain:+.4f}{marker}") print(f"\nBest split: feature ≤ {best_split:.1f} with gain = {best_gain:.4f}") if best_gain > 0: print("✓ Split is beneficial (gain > 0)")else: print("✗ Split is not beneficial (gain ≤ 0)")XGBoost's second-order approximation has a deep connection to Newton's method in numerical optimization. Understanding this connection illuminates why XGBoost converges faster and achieves better solutions.
Newton's Method Recap
For minimizing a function $f(\theta)$, Newton's method uses the update:
$$\theta_{t+1} = \theta_t - \frac{f'(\theta_t)}{f''(\theta_t)}$$
This is equivalent to:
XGBoost as Functional Newton's Method
In XGBoost, we're optimizing in function space rather than parameter space. The "parameter" is the prediction $\hat{y}$, and we're finding the best increment $f_t(x)$ to add.
For each sample, the optimal increment is:
$$f_t(x_i)^* = -\frac{g_i}{h_i}$$
But we can't set arbitrary per-sample values—we're constrained by tree structure. The tree aggregates samples in leaves, giving:
$$w_j^* = -\frac{G_j}{H_j + \lambda} = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}$$
This is exactly a regularized Newton step at the leaf level!
| Newton's Method | XGBoost |
|---|---|
| Parameter θ | Prediction ŷ (in function space) |
| First derivative f'(θ) | Gradient g = ∂l/∂ŷ |
| Second derivative f''(θ) | Hessian h = ∂²l/∂ŷ² |
| Update: -f'/f'' | Leaf weight: -G/(H+λ) |
| Convergence: quadratic | Fast convergence in practice |
Advantages of the Newton-like Approach
Adaptive Step Sizes: Unlike gradient descent with a fixed learning rate, the Hessian provides curvature-aware step sizes. When the loss surface is flat (small h), larger steps are taken; when curved (large h), conservative steps are taken.
Faster Convergence: Newton's method has quadratic convergence near the optimum, while gradient descent has linear convergence. In practice, this means fewer boosting iterations are needed.
Scale Invariance: The ratio $g/h$ is scale-invariant in certain respects—if you scale your target variable, the optimal weights adjust appropriately.
Loss-Agnostic Framework: By computing $g$ and $h$ for any twice-differentiable loss, the same tree-building algorithm works for regression, classification, and custom objectives.
Traditional gradient boosting is like gradient descent: it uses only the gradient direction, requiring careful learning rate tuning. XGBoost is like Newton's method: it uses curvature to take appropriately-sized steps. This is why XGBoost often needs fewer trees and less tuning to achieve good results.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import numpy as npimport matplotlib.pyplot as plt def gradient_descent(gradient_fn, hessian_fn, x0, lr, n_iters): """Gradient Descent: x_{t+1} = x_t - lr * gradient(x_t)""" x = x0 trajectory = [x] for _ in range(n_iters): x = x - lr * gradient_fn(x) trajectory.append(x) return np.array(trajectory) def newton_method(gradient_fn, hessian_fn, x0, n_iters): """Newton's Method: x_{t+1} = x_t - gradient(x_t) / hessian(x_t)""" x = x0 trajectory = [x] for _ in range(n_iters): g = gradient_fn(x) h = hessian_fn(x) x = x - g / h trajectory.append(x) return np.array(trajectory) # Example: Minimize f(x) = x^2 (simple quadratic)# Gradient: f'(x) = 2x# Hessian: f''(x) = 2 gradient_fn = lambda x: 2 * xhessian_fn = lambda x: 2 x0 = 5.0 # Start far from optimum (x* = 0)n_iters = 10 # Run both methodsgd_trajectory = gradient_descent(gradient_fn, hessian_fn, x0, lr=0.3, n_iters=n_iters)newton_trajectory = newton_method(gradient_fn, hessian_fn, x0, n_iters=n_iters) print("Comparison: Gradient Descent vs Newton's Method")print("=" * 60)print(f"Objective: f(x) = x², minimum at x* = 0")print(f"Starting point: x0 = {x0}")print()print(f"{'Iteration':>10} {'GD (lr=0.3)':>15} {'Newton':>15}")print("-" * 45)for i in range(min(len(gd_trajectory), len(newton_trajectory))): print(f"{i:>10} {gd_trajectory[i]:>15.6f} {newton_trajectory[i]:>15.6f}") print("\n" + "=" * 60)print("Newton's method converges in 1 iteration for quadratics!")print("This is analogous to XGBoost's per-leaf optimal weight.") # For non-quadratic: f(x) = log(1 + exp(x)) (like logistic loss)print("\n" + "=" * 60)print("Non-quadratic example: f(x) = log(1 + exp(x))")print("=" * 60) softplus = lambda x: np.log(1 + np.exp(np.clip(x, -500, 500)))softplus_grad = lambda x: 1 / (1 + np.exp(-x)) # sigmoidsoftplus_hess = lambda x: (np.exp(x)) / (1 + np.exp(x))**2 # sigmoid * (1-sigmoid) x0 = 3.0n_iters = 15 gd_trajectory2 = gradient_descent(softplus_grad, softplus_hess, x0, lr=0.5, n_iters=n_iters)newton_trajectory2 = newton_method(softplus_grad, softplus_hess, x0, n_iters=n_iters) print(f"Starting point: x0 = {x0}")print(f"Optimum at x* → -∞ (but plateaus)")print()print(f"{'Iter':>5} {'f(x) GD':>15} {'f(x) Newton':>15}")print("-" * 40)for i in range(min(8, len(gd_trajectory2), len(newton_trajectory2))): print(f"{i:>5} {softplus(gd_trajectory2[i]):>15.6f} " f"{softplus(newton_trajectory2[i]):>15.6f}") print("\nNewton converges faster even for non-quadratic losses!")Understanding the second-order approximation has practical implications for how you use XGBoost effectively.
1. Custom Loss Functions
To use a custom loss function in XGBoost, you must provide both gradient and Hessian functions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
import xgboost as xgbimport numpy as np def custom_huber_objective(y_pred, dtrain, delta=1.0): """ Custom Huber loss objective for XGBoost. Huber loss is quadratic for small residuals, linear for large. This makes it robust to outliers. Must return (gradient, hessian) arrays. """ y_true = dtrain.get_label() residual = y_pred - y_true abs_residual = np.abs(residual) # Gradient # g = residual if |residual| <= delta # g = delta * sign(residual) if |residual| > delta grad = np.where( abs_residual <= delta, residual, delta * np.sign(residual) ) # Hessian # h = 1 if |residual| <= delta # h = 0 (or small epsilon) if |residual| > delta hess = np.where( abs_residual <= delta, np.ones_like(residual), np.full_like(residual, 1e-6) # Small epsilon for numerical stability ) return grad, hess def custom_huber_eval(y_pred, dtrain, delta=1.0): """Evaluation metric for Huber loss.""" y_true = dtrain.get_label() residual = np.abs(y_pred - y_true) loss = np.where( residual <= delta, 0.5 * residual ** 2, delta * residual - 0.5 * delta ** 2 ) return 'huber_loss', float(np.mean(loss)) # Usage example# X_train, y_train = ... # Your data# dtrain = xgb.DMatrix(X_train, label=y_train) # model = xgb.train(# params={'tree_method': 'hist'},# dtrain=dtrain,# num_boost_round=100,# obj=lambda preds, dtrain: custom_huber_objective(preds, dtrain, delta=1.0),# feval=lambda preds, dtrain: custom_huber_eval(preds, dtrain, delta=1.0)# ) print("Custom Loss Function Example: Huber Loss")print("=" * 50)print("To use custom losses in XGBoost, you must provide:")print(" 1. Gradient: ∂L/∂ŷ for each sample")print(" 2. Hessian: ∂²L/∂ŷ² for each sample")print()print("Common custom losses:")print(" - Huber (robust regression)")print(" - Focal (class imbalance)")print(" - Quantile (prediction intervals)")print(" - Custom business metric)** 2. Understanding Sample Weights Through Hessians **
For logistic loss, the Hessian $h_i = p_i(1 - p_i)$ acts as an implicit sample weight:
- Samples near the decision boundary($p \approx 0.5$): high $h$, high influence
- Confident predictions($p \approx 0$ or $p \approx 1$): low $h$, low influence
This means XGBoost automatically focuses on uncertain samples—similar to boosting's focus on hard examples, but arising naturally from the loss curvature.
** 3. Minimum Hessian Sum(min_child_weight) **
XGBoost has a parameter `min_child_weight` (default=1) which requires:
$$H_{\text{child}} \geq \text{min_child_weight}$$
For a split to be valid, both children must have sufficient Hessian sum. This prevents:
For MSE loss where $h=1$, min_child_weight=1 means each leaf needs at least 1 sample.
For logistic loss, the interpretation is more nuanced—it's about summed confidence, not sample count.
For the optimization to work correctly, Hessians must be positive (strictly, for convex losses). If you're implementing a custom loss and get NaN or unexpected behavior, check that your Hessian is always positive. A common fix is to use max(h, epsilon) for some small epsilon like 1e-6.
We have thoroughly explored XGBoost's second-order Taylor approximation—the mathematical technique that enables its efficient and accurate optimization.
** What's Next**
With the optimization framework understood, we'll explore XGBoost's ** efficient split finding algorithms **.While the gain formula tells us how good a split is, actually finding the best split among all features and thresholds is computationally challenging.XGBoost introduces several innovations—including approximate algorithms and histogram - based methods—that make this tractable for large datasets.
You now understand XGBoost's second-order approximation—the Newton-like optimization that distinguishes it from traditional gradient boosting. This foundation in gradients and Hessians will be essential as we explore split finding, and you'll use these concepts whenever implementing custom loss functions.