Loading content...
Given the split gain formula from the previous page, the next challenge is finding the best split point among all possible features and thresholds. This is perhaps the most computationally demanding part of tree construction.
Consider a dataset with 1 million samples and 500 features. For each split, we could theoretically need to evaluate:
XGBoost's practical success depends on efficient split finding algorithms that reduce this computational burden without sacrificing (or minimally sacrificing) split quality. This page explores these critical innovations.
By the end of this page, you will understand: (1) The exact greedy algorithm and its limitations, (2) Approximate split finding with quantile sketches, (3) Histogram-based splitting (the modern standard), (4) Weighted quantiles for second-order statistics, and (5) How to choose between splitting strategies based on your data.
Let's start with the conceptually simplest approach: the exact greedy algorithm. This serves as our baseline and helps understand what we're trying to optimize.
Algorithm Overview
For each node to split:
Computational Analysis
For a single split:
For a complete tree of depth $D$:
This is acceptable for small datasets but problematic for large ones.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import numpy as npfrom typing import Tuple, Optional def exact_greedy_split( X: np.ndarray, g: np.ndarray, h: np.ndarray, lambda_: float = 1.0, gamma: float = 0.0, min_child_weight: float = 1.0) -> Tuple[Optional[int], Optional[float], float]: """ Exact greedy algorithm to find the best split. Parameters: ----------- X : array of shape (n_samples, n_features) Feature matrix g : array of shape (n_samples,) First-order gradients h : array of shape (n_samples,) Second-order gradients (Hessians) lambda_ : float L2 regularization parameter gamma : float Minimum split gain threshold min_child_weight : float Minimum sum of Hessians in each child Returns: -------- best_feature : int or None Index of best feature (None if no valid split) best_threshold : float or None Threshold value for split best_gain : float Gain of the best split """ n_samples, n_features = X.shape G_total = np.sum(g) H_total = np.sum(h) best_gain = -np.inf best_feature = None best_threshold = None # Score of the current node (before split) score_before = (G_total ** 2) / (H_total + lambda_) # For each feature for feature_idx in range(n_features): # Sort samples by this feature sorted_indices = np.argsort(X[:, feature_idx]) # Running sums for left child G_left = 0.0 H_left = 0.0 # Scan through sorted samples for i in range(n_samples - 1): idx = sorted_indices[i] G_left += g[idx] H_left += h[idx] G_right = G_total - G_left H_right = H_total - H_left # Check minimum child weight constraint if H_left < min_child_weight or H_right < min_child_weight: continue # Skip if this value equals the next (no split point between) if X[sorted_indices[i], feature_idx] == X[sorted_indices[i+1], feature_idx]: continue # Compute gain score_left = (G_left ** 2) / (H_left + lambda_) score_right = (G_right ** 2) / (H_right + lambda_) gain = 0.5 * (score_left + score_right - score_before) - gamma if gain > best_gain: best_gain = gain best_feature = feature_idx # Threshold is midpoint between current and next value best_threshold = (X[sorted_indices[i], feature_idx] + X[sorted_indices[i+1], feature_idx]) / 2 if best_gain <= 0: return None, None, best_gain return best_feature, best_threshold, best_gain # Demonstrationnp.random.seed(42) # Generate sample datan_samples = 1000n_features = 5X = np.random.randn(n_samples, n_features)y = 2 * X[:, 0] - 3 * X[:, 1] + np.random.randn(n_samples) * 0.5 # Linear relationship # Simulate being at some point in boostingy_pred = np.zeros(n_samples) # Initial predictiong = y_pred - y # Gradient for MSEh = np.ones(n_samples) # Hessian for MSE print("Exact Greedy Split Finding Algorithm")print("=" * 55)print(f"Dataset: {n_samples} samples, {n_features} features")print(f"Regularization: λ=1.0, γ=0.0")print() # Find best splitbest_feat, best_thresh, best_gain = exact_greedy_split(X, g, h) print(f"Best split found:")print(f" Feature index: {best_feat}")print(f" Threshold: {best_thresh:.4f}")print(f" Gain: {best_gain:.4f}")print() # Show complexityprint("Computational Complexity:")print(f" - Sorting: O({n_features} × {n_samples} × log({n_samples}))")print(f" = O({n_features * n_samples * int(np.log2(n_samples))})")print(f" - Scanning: O({n_features} × {n_samples})")print(f" Total operations: ~{n_features * n_samples * (int(np.log2(n_samples)) + 1)}")The exact greedy algorithm requires sorting all samples for every feature at every split. For a dataset with 10 million samples and 1000 features, this becomes prohibitively expensive. Additionally, it requires keeping all data in memory, which is impossible for out-of-core (larger-than-RAM) datasets.
To scale to larger datasets, XGBoost introduced approximate split finding using quantile-based candidate generation.
The Key Insight
Instead of considering every unique value as a potential split point, we can:
If we use $q$ quantiles per feature, we reduce candidate points from $O(n)$ to $O(q)$—typically $q \approx 256$ regardless of dataset size.
Algorithm: Approximate
1. For each feature k:
- Compute percentiles at 1/q, 2/q, ..., (q-1)/q
- These become candidate split points: {s_k1, s_k2, ..., s_{k,q-1}}
2. For each tree node to split:
- For each feature k and candidate threshold s_kj:
- Partition samples into left (x_k ≤ s_kj) and right
- Compute G_L, H_L, G_R, H_R
- Calculate split gain
- Apply the best (feature, threshold) combination found
Two Variants: Global vs Local
Global Quantiles:
Local Quantiles:
Modern XGBoost uses a hybrid: global quantiles with enough candidates that local adaptation matters less.
| Aspect | Exact Greedy | Approximate (Quantile) |
|---|---|---|
| Candidate split points | All unique values | Quantile boundaries |
| Candidates per feature | O(n) | O(q), typically 256 |
| Memory | Must fit in RAM | Streaming possible |
| Split quality | Optimal | Near-optimal |
| Use case | Small to medium data | Large scale data |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
import numpy as npfrom typing import List, Tuple def compute_quantile_candidates(X: np.ndarray, n_bins: int = 256) -> List[np.ndarray]: """ Compute quantile-based candidate split points for each feature. Parameters: ----------- X : array of shape (n_samples, n_features) n_bins : int Number of bins (candidates = n_bins - 1) Returns: -------- candidates : list of arrays Candidate threshold values for each feature """ n_samples, n_features = X.shape candidates = [] for feature_idx in range(n_features): feature_values = X[:, feature_idx] # Compute percentiles percentiles = np.linspace(0, 100, n_bins + 1)[1:-1] # Exclude 0 and 100 thresholds = np.percentile(feature_values, percentiles) # Remove duplicates thresholds = np.unique(thresholds) candidates.append(thresholds) return candidates def approximate_split( X: np.ndarray, g: np.ndarray, h: np.ndarray, candidates: List[np.ndarray], lambda_: float = 1.0, gamma: float = 0.0) -> Tuple[int, float, float]: """ Approximate split finding using pre-computed quantile candidates. """ n_samples, n_features = X.shape G_total = np.sum(g) H_total = np.sum(h) best_gain = -np.inf best_feature = -1 best_threshold = 0.0 score_parent = (G_total ** 2) / (H_total + lambda_) for feature_idx in range(n_features): feature_candidates = candidates[feature_idx] feature_values = X[:, feature_idx] for threshold in feature_candidates: # Split samples left_mask = feature_values <= threshold G_left = np.sum(g[left_mask]) H_left = np.sum(h[left_mask]) G_right = G_total - G_left H_right = H_total - H_left if H_left < 1.0 or H_right < 1.0: continue # Compute gain score_left = (G_left ** 2) / (H_left + lambda_) score_right = (G_right ** 2) / (H_right + lambda_) gain = 0.5 * (score_left + score_right - score_parent) - gamma if gain > best_gain: best_gain = gain best_feature = feature_idx best_threshold = threshold return best_feature, best_threshold, best_gain # Demonstration: Compare exact vs approximatenp.random.seed(42)n_samples = 10000n_features = 10X = np.random.randn(n_samples, n_features)y = np.sin(X[:, 0]) + 0.5 * X[:, 1] ** 2 + np.random.randn(n_samples) * 0.1 g = -y # Simplified gradientsh = np.ones(n_samples) print("Approximate vs Exact Split Finding")print("=" * 60)print(f"Dataset: {n_samples} samples, {n_features} features")print() # Exact (for comparison)import timestart = time.time()exact_feat, exact_thresh, exact_gain = exact_greedy_split(X, g, h)exact_time = time.time() - startprint(f"Exact Greedy:")print(f" Best feature: {exact_feat}, threshold: {exact_thresh:.4f}")print(f" Gain: {exact_gain:.4f}")print(f" Time: {exact_time:.4f}s")print() # Approximate with different numbers of binsfor n_bins in [16, 64, 256]: candidates = compute_quantile_candidates(X, n_bins) start = time.time() approx_feat, approx_thresh, approx_gain = approximate_split( X, g, h, candidates ) approx_time = time.time() - start total_candidates = sum(len(c) for c in candidates) gain_ratio = approx_gain / exact_gain if exact_gain != 0 else 1.0 print(f"Approximate ({n_bins} bins, {total_candidates} total candidates):") print(f" Best feature: {approx_feat}, threshold: {approx_thresh:.4f}") print(f" Gain: {approx_gain:.4f} ({gain_ratio:.1%} of exact)") print(f" Time: {approx_time:.4f}s ({exact_time/approx_time:.1f}x speedup)") print()Standard quantile methods treat all samples equally. But XGBoost's second-order approximation introduces a key insight: samples should be weighted by their Hessians.
Why Hessian Weighting?
Recall the approximated objective: $$\tilde{\mathcal{L}} = \sum_i \left[ g_i f(x_i) + \frac{1}{2} h_i f(x_i)^2 \right]$$
This can be rewritten as: $$\tilde{\mathcal{L}} = \sum_i \frac{1}{2} h_i \left( f(x_i) - \left(-\frac{g_i}{h_i}\right) \right)^2 + \text{constant}$$
This is a weighted least squares problem where:
Therefore, when computing quantiles to find good split candidates, we should use Hessian-weighted quantiles!
Weighted Quantile Definition
For a feature with values $(x_1, ..., x_n)$ and corresponding weights $(h_1, ..., h_n)$, the $\phi$-quantile is the value $z$ such that:
$$\frac{\sum_{i: x_i < z} h_i}{\sum_{i=1}^n h_i} < \phi \leq \frac{\sum_{i: x_i \leq z} h_i}{\sum_{i=1}^n h_i}$$
In other words, $\phi$ fraction of the weighted mass falls below $z$.
The Quantile Sketch Problem
For streaming or distributed data, we cannot store all values. XGBoost uses quantile sketching algorithms that:
The algorithm used is based on the Q-digest or GK sketch with extensions for weighted samples.
For classification with logistic loss, samples near the decision boundary (p ≈ 0.5) have higher Hessians h = p(1-p). Weighted quantiles ensure more candidate split points in these uncertain regions, where fine-grained decisions matter most. Samples with confident predictions (low h) get fewer candidates, which is appropriate since their contribution to the objective is smaller.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
import numpy as np def weighted_quantile(values: np.ndarray, weights: np.ndarray, percentiles: np.ndarray) -> np.ndarray: """ Compute weighted quantiles. Parameters: ----------- values : array of shape (n,) Feature values weights : array of shape (n,) Weights (Hessians in XGBoost context) percentiles : array Percentiles to compute (0-100 scale) Returns: -------- quantiles : array Weighted quantile values """ # Sort by values sorted_idx = np.argsort(values) sorted_values = values[sorted_idx] sorted_weights = weights[sorted_idx] # Cumulative weight sum cumsum = np.cumsum(sorted_weights) total = cumsum[-1] # Normalize to 0-100 scale cumsum_pct = (cumsum / total) * 100 # Find quantiles quantiles = np.interp(percentiles, cumsum_pct, sorted_values) return quantiles def compare_weighted_unweighted_quantiles(): """ Demonstrate the difference between weighted and unweighted quantiles. """ np.random.seed(42) # Simulate binary classification scenario n = 1000 # Feature values (predictive of class) class_0 = np.random.randn(n // 2) - 1 # Class 0 samples class_1 = np.random.randn(n // 2) + 1 # Class 1 samples X = np.concatenate([class_0, class_1]) # Simulate predictions from previous round # Samples far from boundary have extreme predictions logits = X * 0.5 # Correlated with feature probs = 1 / (1 + np.exp(-logits)) # Hessians for logistic loss: h = p(1-p) h = probs * (1 - probs) print("Weighted vs Unweighted Quantile Comparison") print("=" * 60) print(f"Scenario: Binary classification with {n} samples") print() # Compute quantile candidates percentiles = np.array([10, 25, 50, 75, 90]) # Unweighted (standard) quantiles unweighted = np.percentile(X, percentiles) # Weighted quantiles (by Hessian) weighted = weighted_quantile(X, h, percentiles) print("Quantile candidates (split thresholds):") print("-" * 60) print(f"{'Percentile':>12} {'Unweighted':>15} {'Weighted (by h)':>18}") print("-" * 60) for p, uw, w in zip(percentiles, unweighted, weighted): print(f"{p:>12}% {uw:>15.4f} {w:>18.4f}") print() print("Analysis:") print("-" * 60) # Show Hessian distribution boundary_mask = np.abs(X) < 1 # Near decision boundary h_boundary = np.mean(h[boundary_mask]) h_extreme = np.mean(h[~boundary_mask]) print(f"Mean Hessian near boundary (|X| < 1): {h_boundary:.4f}") print(f"Mean Hessian away from boundary: {h_extreme:.4f}") print(f"Ratio: {h_boundary/h_extreme:.2f}x") print() print("Weighted quantiles shift toward the boundary region where") print("samples have higher Hessians (more uncertain predictions).") print("This places more candidate split points where they matter most!") compare_weighted_unweighted_quantiles()The state-of-the-art in XGBoost split finding is histogram-based methods. This approach, enabled by the tree_method='hist' parameter (now default), provides dramatic speedups.
Core Idea
Instead of considering continuous feature values, we:
The Histogram Trick
For each feature and each leaf node, we maintain a histogram:
histogram[bin_id] = (sum_of_g_in_bin, sum_of_h_in_bin)
Building the histogram takes $O(n)$ (one pass through samples). Finding the best split takes $O(\text{bins})$ (scan the histogram).
Since $\text{bins} \ll n$ (typically 256 vs millions), this is a massive speedup!
Histogram Subtraction Trick
An additional optimization: for binary trees, we only need to build the histogram for one child explicitly. The other child's histogram is:
$$\text{histogram}{\text{sibling}} = \text{histogram}{\text{parent}} - \text{histogram}_{\text{child}}$$
This halves the histogram construction time.
Memory Efficiency
Histograms use fixed memory regardless of dataset size:
This contrasts with exact methods that need sample-level storage.
| Operation | Exact Greedy | Histogram-Based |
|---|---|---|
| Build structure | O(d·n·log n) sort | O(n) bin mapping |
| Per-node split | O(d·n) scan | O(d·bins) scan |
| Memory per node | O(n) samples | O(d·bins) (fixed) |
| Cache efficiency | Poor (random access) | Excellent |
| Parallelization | Limited | Highly parallel |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183
import numpy as npfrom typing import Tuple, List class HistogramSplitFinder: """ Histogram-based split finding for gradient boosting. """ def __init__(self, n_bins: int = 256): self.n_bins = n_bins self.bin_edges = None # Per-feature bin edges def fit_bin_edges(self, X: np.ndarray): """ Compute bin edges for each feature (done once before training). """ n_features = X.shape[1] self.bin_edges = [] for j in range(n_features): # Compute percentile-based bin edges edges = np.percentile( X[:, j], np.linspace(0, 100, self.n_bins + 1) ) # Ensure uniqueness edges = np.unique(edges) self.bin_edges.append(edges) def bin_data(self, X: np.ndarray) -> np.ndarray: """ Convert continuous features to bin indices. """ n_samples, n_features = X.shape binned = np.zeros((n_samples, n_features), dtype=np.uint8) for j in range(n_features): binned[:, j] = np.digitize(X[:, j], self.bin_edges[j][1:-1]) return binned def build_histogram(self, binned_data: np.ndarray, g: np.ndarray, h: np.ndarray, feature_idx: int) -> Tuple[np.ndarray, np.ndarray]: """ Build histogram for a single feature. Returns (G_per_bin, H_per_bin). """ n_bins = len(self.bin_edges[feature_idx]) - 1 G_hist = np.zeros(n_bins) H_hist = np.zeros(n_bins) # Aggregate gradients into bins bins = binned_data[:, feature_idx] np.add.at(G_hist, bins.astype(int), g) np.add.at(H_hist, bins.astype(int), h) return G_hist, H_hist def find_best_split_from_histogram( self, G_hist: np.ndarray, H_hist: np.ndarray, feature_idx: int, lambda_: float = 1.0, gamma: float = 0.0 ) -> Tuple[float, float]: """ Find best split by scanning histogram. """ G_total = np.sum(G_hist) H_total = np.sum(H_hist) score_parent = (G_total ** 2) / (H_total + lambda_) best_gain = -np.inf best_threshold = None G_left = 0.0 H_left = 0.0 bin_edges = self.bin_edges[feature_idx] for bin_idx in range(len(G_hist) - 1): G_left += G_hist[bin_idx] H_left += H_hist[bin_idx] G_right = G_total - G_left H_right = H_total - H_left if H_left < 1.0 or H_right < 1.0: continue score_left = (G_left ** 2) / (H_left + lambda_) score_right = (G_right ** 2) / (H_right + lambda_) gain = 0.5 * (score_left + score_right - score_parent) - gamma if gain > best_gain: best_gain = gain # Threshold is the right edge of current bin best_threshold = bin_edges[bin_idx + 1] return best_threshold, best_gain def find_best_split( self, X_binned: np.ndarray, g: np.ndarray, h: np.ndarray, lambda_: float = 1.0, gamma: float = 0.0 ) -> Tuple[int, float, float]: """ Find the best split across all features using histograms. """ n_features = X_binned.shape[1] best_gain = -np.inf best_feature = -1 best_threshold = None for j in range(n_features): # Build histogram for this feature G_hist, H_hist = self.build_histogram(X_binned, g, h, j) # Find best split from histogram threshold, gain = self.find_best_split_from_histogram( G_hist, H_hist, j, lambda_, gamma ) if gain > best_gain: best_gain = gain best_feature = j best_threshold = threshold return best_feature, best_threshold, best_gain # Demonstrationnp.random.seed(42)n_samples = 100000 # Large datasetn_features = 50 X = np.random.randn(n_samples, n_features)y = np.sin(X[:, 0] * 2) + 0.5 * X[:, 1] + np.random.randn(n_samples) * 0.1g = -yh = np.ones(n_samples) print("Histogram-Based Split Finding")print("=" * 60)print(f"Dataset: {n_samples:,} samples, {n_features} features")print() # Initialize and fit histogram finderhist_finder = HistogramSplitFinder(n_bins=256)hist_finder.fit_bin_edges(X) import time # Bin the data (done once)start = time.time()X_binned = hist_finder.bin_data(X)bin_time = time.time() - startprint(f"Binning time: {bin_time:.4f}s (one-time preprocessing)") # Find split using histogramstart = time.time()feature, threshold, gain = hist_finder.find_best_split(X_binned, g, h)split_time = time.time() - start print(f"Split finding time: {split_time:.4f}s")print(f"Best split: feature {feature}, threshold {threshold:.4f}, gain {gain:.4f}")print() # Compare with exact (on smaller subset for fairness)print("Comparison with Exact Greedy (on 10K sample subset):")X_small = X[:10000]g_small = g[:10000]h_small = h[:10000] start = time.time()exact_feat, exact_thresh, exact_gain = exact_greedy_split(X_small, g_small, h_small)exact_time = time.time() - startprint(f" Exact on 10K samples: {exact_time:.4f}s")print(f" Histogram on 100K samples: {split_time:.4f}s")print(f" Histogram processes 10x more data in similar time!")XGBoost exposes the split-finding algorithm through the tree_method parameter. Understanding this parameter helps you choose the right algorithm for your use case.
Available Options
| Value | Algorithm | Memory | Speed | Best For |
|---|---|---|---|---|
| exact | Exact greedy | High (in-memory) | Slow | Small datasets (<10K rows) |
| approx | Approximate (quantile) | Medium | Medium | Medium datasets |
| hist | Histogram-based | Low | Fast | Large datasets (default) |
| gpu_hist | Histogram on GPU | GPU memory | Very fast | GPU available, large data |
Recommendation
For most use cases, tree_method='hist' (now the default) is the best choice. It provides:
Use tree_method='exact' only when:
Use tree_method='gpu_hist' when:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
import xgboost as xgbimport numpy as npimport time # Generate sample datanp.random.seed(42)n_samples = 50000n_features = 100 X = np.random.randn(n_samples, n_features)y = (X[:, 0] + X[:, 1] > 0).astype(int) # Binary classification dtrain = xgb.DMatrix(X, label=y) # Common parametersbase_params = { 'objective': 'binary:logistic', 'max_depth': 6, 'learning_rate': 0.1, 'n_jobs': -1, 'verbosity': 0} print("XGBoost tree_method Comparison")print("=" * 60)print(f"Dataset: {n_samples:,} samples, {n_features} features")print() # Compare different tree methodsfor tree_method in ['exact', 'approx', 'hist']: params = {**base_params, 'tree_method': tree_method} start = time.time() model = xgb.train(params, dtrain, num_boost_round=100) elapsed = time.time() - start print(f"tree_method='{tree_method}':") print(f" Training time: {elapsed:.2f}s") print() # The 'hist' method is usually 5-20x faster than 'exact'# with virtually identical prediction quality print("Practical Recommendations:")print("-" * 60)print("1. Use 'hist' (default) for most cases")print("2. Use 'gpu_hist' if you have a GPU and large data")print("3. Use 'exact' only for small data or debugging")print("4. Use 'approx' rarely - 'hist' is usually better")When using tree_method='hist', the max_bin parameter controls the number of histogram bins (default 256). Increasing it (e.g., 512) can slightly improve split quality at the cost of more memory and computation. Decreasing it (e.g., 64) saves memory for very large datasets. For most cases, 256 is a good balance.
Beyond algorithmic complexity, XGBoost optimizes for cache efficiency—a crucial factor for modern CPU performance.
The Memory Hierarchy Problem
Modern CPUs have a memory hierarchy:
Accessing data not in cache triggers expensive memory fetches. For large datasets, random memory access patterns can slow algorithms by 10-100x.
XGBoost's Cache Optimizations
Block-Based Data Storage
Sorted Index Reuse
Gradient Statistics Buffering
Histogram Method's Cache Advantage
The histogram method is particularly cache-friendly:
This is why tree_method='hist' often provides speedups beyond what simple complexity analysis predicts.
tree_method='hist' has best cache behaviorMany ML researchers focus only on algorithmic complexity (Big-O). But in practice, cache behavior can dominate runtime for large datasets. An O(n) algorithm with poor cache behavior can be slower than an O(n log n) algorithm with good cache behavior. XGBoost's attention to these details is part of why it's so fast.
We have explored the algorithms that make XGBoost's split finding efficient and scalable.
What's Next
With split finding covered, we'll explore XGBoost's sparsity-aware algorithm—how it efficiently handles missing values and sparse data. This is crucial for real-world datasets where missing values are common and categorical variables create sparse one-hot encodings.
You now understand XGBoost's efficient split finding algorithms—from exact greedy to histogram-based methods. These algorithms enable XGBoost to scale to datasets with millions of samples while maintaining high split quality. The histogram method, in particular, is a key innovation that makes XGBoost practical for large-scale machine learning.