Xgboost - Learning Module

Loading content...

0/278

Efficient Split Finding

The Scalability Challenge

Given the split gain formula from the previous page, the next challenge is finding the best split point among all possible features and thresholds. This is perhaps the most computationally demanding part of tree construction.

Consider a dataset with 1 million samples and 500 features. For each split, we could theoretically need to evaluate:

500 features × ~1 million unique values = 500 million possible splits per leaf!

XGBoost's practical success depends on efficient split finding algorithms that reduce this computational burden without sacrificing (or minimally sacrificing) split quality. This page explores these critical innovations.

What You Will Learn

By the end of this page, you will understand: (1) The exact greedy algorithm and its limitations, (2) Approximate split finding with quantile sketches, (3) Histogram-based splitting (the modern standard), (4) Weighted quantiles for second-order statistics, and (5) How to choose between splitting strategies based on your data.

The Exact Greedy Algorithm

Let's start with the conceptually simplest approach: the exact greedy algorithm. This serves as our baseline and helps understand what we're trying to optimize.

Algorithm Overview

For each node to split:

For each feature $k$: a. Sort samples by feature $k$ values b. Scan through sorted samples, computing gain for each possible split point c. Track the best (feature, threshold) combination
Apply the best split found across all features

Computational Analysis

For a single split:

Sorting: $O(n \log n)$ per feature, or $O(d \cdot n \log n)$ for $d$ features
Scanning: $O(n)$ per feature (linear scan with running $G_L, H_L$ sums)
Total: $O(d \cdot n \log n)$ for one split

For a complete tree of depth $D$:

Each level processes (in total) all $n$ samples across all nodes
Total: $O(D \cdot d \cdot n \log n)$

This is acceptable for small datasets but problematic for large ones.

exact_greedy_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
from typing import Tuple, Optional
 
def exact_greedy_split(
    X: np.ndarray,
    g: np.ndarray,
    h: np.ndarray,
    lambda_: float = 1.0,
    gamma: float = 0.0,
    min_child_weight: float = 1.0
) -> Tuple[Optional[int], Optional[float], float]:
    """
    Exact greedy algorithm to find the best split.
    
    Parameters:
    -----------
    X : array of shape (n_samples, n_features)
        Feature matrix
    g : array of shape (n_samples,)
        First-order gradients
    h : array of shape (n_samples,)
        Second-order gradients (Hessians)
    lambda_ : float
        L2 regularization parameter
    gamma : float
        Minimum split gain threshold
    min_child_weight : float
        Minimum sum of Hessians in each child
        
    Returns:
    --------
    best_feature : int or None
        Index of best feature (None if no valid split)
    best_threshold : float or None
        Threshold value for split
    best_gain : float
        Gain of the best split
    """
    n_samples, n_features = X.shape
    G_total = np.sum(g)
    H_total = np.sum(h)
    
    best_gain = -np.inf
    best_feature = None
    best_threshold = None
    
    # Score of the current node (before split)
    score_before = (G_total ** 2) / (H_total + lambda_)
    
    # For each feature
    for feature_idx in range(n_features):
        # Sort samples by this feature
        sorted_indices = np.argsort(X[:, feature_idx])
        
        # Running sums for left child
        G_left = 0.0
        H_left = 0.0
        
        # Scan through sorted samples
        for i in range(n_samples - 1):
            idx = sorted_indices[i]
            G_left += g[idx]
            H_left += h[idx]
            G_right = G_total - G_left
            H_right = H_total - H_left
            
            # Check minimum child weight constraint
            if H_left < min_child_weight or H_right < min_child_weight:
                continue
            
            # Skip if this value equals the next (no split point between)
            if X[sorted_indices[i], feature_idx] == X[sorted_indices[i+1], feature_idx]:
                continue
            
            # Compute gain
            score_left = (G_left ** 2) / (H_left + lambda_)
            score_right = (G_right ** 2) / (H_right + lambda_)
            gain = 0.5 * (score_left + score_right - score_before) - gamma
            
            if gain > best_gain:
                best_gain = gain
                best_feature = feature_idx
                # Threshold is midpoint between current and next value
                best_threshold = (X[sorted_indices[i], feature_idx] + 
                                  X[sorted_indices[i+1], feature_idx]) / 2
    
    if best_gain <= 0:
        return None, None, best_gain
    
    return best_feature, best_threshold, best_gain
 
 
# Demonstration
np.random.seed(42)
 
# Generate sample data
n_samples = 1000
n_features = 5
X = np.random.randn(n_samples, n_features)
y = 2 * X[:, 0] - 3 * X[:, 1] + np.random.randn(n_samples) * 0.5  # Linear relationship
 
# Simulate being at some point in boosting
y_pred = np.zeros(n_samples)  # Initial prediction
g = y_pred - y  # Gradient for MSE
h = np.ones(n_samples)  # Hessian for MSE
 
print("Exact Greedy Split Finding Algorithm")
print("=" * 55)
print(f"Dataset: {n_samples} samples, {n_features} features")
print(f"Regularization: λ=1.0, γ=0.0")
print()
 
# Find best split
best_feat, best_thresh, best_gain = exact_greedy_split(X, g, h)
 
print(f"Best split found:")
print(f"  Feature index: {best_feat}")
print(f"  Threshold: {best_thresh:.4f}")
print(f"  Gain: {best_gain:.4f}")
print()
 
# Show complexity
print("Computational Complexity:")
print(f"  - Sorting: O({n_features} × {n_samples} × log({n_samples}))")
print(f"           = O({n_features * n_samples * int(np.log2(n_samples))})")
print(f"  - Scanning: O({n_features} × {n_samples})")
print(f"  Total operations: ~{n_features * n_samples * (int(np.log2(n_samples)) + 1)}")

Exact Greedy Limitations

The exact greedy algorithm requires sorting all samples for every feature at every split. For a dataset with 10 million samples and 1000 features, this becomes prohibitively expensive. Additionally, it requires keeping all data in memory, which is impossible for out-of-core (larger-than-RAM) datasets.

Approximate Split Finding with Quantiles

To scale to larger datasets, XGBoost introduced approximate split finding using quantile-based candidate generation.

The Key Insight

Instead of considering every unique value as a potential split point, we can:

Compute quantile percentiles of each feature's distribution
Use these quantile boundaries as candidate split points
Evaluate the gain only at these candidates

If we use $q$ quantiles per feature, we reduce candidate points from $O(n)$ to $O(q)$—typically $q \approx 256$ regardless of dataset size.

Algorithm: Approximate

1. For each feature k:
   - Compute percentiles at 1/q, 2/q, ..., (q-1)/q
   - These become candidate split points: {s_k1, s_k2, ..., s_{k,q-1}}

2. For each tree node to split:
   - For each feature k and candidate threshold s_kj:
     - Partition samples into left (x_k ≤ s_kj) and right
     - Compute G_L, H_L, G_R, H_R
     - Calculate split gain
   - Apply the best (feature, threshold) combination found

Two Variants: Global vs Local

Global Quantiles:

Compute quantile candidates once before tree construction
Same candidates used for all splits
Pros: Simpler, computed just once
Cons: May miss important split points as the distribution in each node differs

Local Quantiles:

Recompute quantile candidates at each node being split
Candidates adapted to the samples currently in that node
Pros: Better split quality, especially for deeper trees
Cons: More computational overhead

Modern XGBoost uses a hybrid: global quantiles with enough candidates that local adaptation matters less.

Exact vs Approximate Split Finding
Aspect	Exact Greedy	Approximate (Quantile)
Candidate split points	All unique values	Quantile boundaries
Candidates per feature	O(n)	O(q), typically 256
Memory	Must fit in RAM	Streaming possible
Split quality	Optimal	Near-optimal
Use case	Small to medium data	Large scale data

approximate_split_finding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
from typing import List, Tuple
 
def compute_quantile_candidates(X: np.ndarray, n_bins: int = 256) -> List[np.ndarray]:
    """
    Compute quantile-based candidate split points for each feature.
    
    Parameters:
    -----------
    X : array of shape (n_samples, n_features)
    n_bins : int
        Number of bins (candidates = n_bins - 1)
        
    Returns:
    --------
    candidates : list of arrays
        Candidate threshold values for each feature
    """
    n_samples, n_features = X.shape
    candidates = []
    
    for feature_idx in range(n_features):
        feature_values = X[:, feature_idx]
        
        # Compute percentiles
        percentiles = np.linspace(0, 100, n_bins + 1)[1:-1]  # Exclude 0 and 100
        thresholds = np.percentile(feature_values, percentiles)
        
        # Remove duplicates
        thresholds = np.unique(thresholds)
        candidates.append(thresholds)
    
    return candidates
 
 
def approximate_split(
    X: np.ndarray,
    g: np.ndarray,
    h: np.ndarray,
    candidates: List[np.ndarray],
    lambda_: float = 1.0,
    gamma: float = 0.0
) -> Tuple[int, float, float]:
    """
    Approximate split finding using pre-computed quantile candidates.
    """
    n_samples, n_features = X.shape
    G_total = np.sum(g)
    H_total = np.sum(h)
    
    best_gain = -np.inf
    best_feature = -1
    best_threshold = 0.0
    
    score_parent = (G_total ** 2) / (H_total + lambda_)
    
    for feature_idx in range(n_features):
        feature_candidates = candidates[feature_idx]
        feature_values = X[:, feature_idx]
        
        for threshold in feature_candidates:
            # Split samples
            left_mask = feature_values <= threshold
            
            G_left = np.sum(g[left_mask])
            H_left = np.sum(h[left_mask])
            G_right = G_total - G_left
            H_right = H_total - H_left
            
            if H_left < 1.0 or H_right < 1.0:
                continue
            
            # Compute gain
            score_left = (G_left ** 2) / (H_left + lambda_)
            score_right = (G_right ** 2) / (H_right + lambda_)
            gain = 0.5 * (score_left + score_right - score_parent) - gamma
            
            if gain > best_gain:
                best_gain = gain
                best_feature = feature_idx
                best_threshold = threshold
    
    return best_feature, best_threshold, best_gain
 
 
# Demonstration: Compare exact vs approximate
np.random.seed(42)
n_samples = 10000
n_features = 10
X = np.random.randn(n_samples, n_features)
y = np.sin(X[:, 0]) + 0.5 * X[:, 1] ** 2 + np.random.randn(n_samples) * 0.1
 
g = -y  # Simplified gradients
h = np.ones(n_samples)
 
print("Approximate vs Exact Split Finding")
print("=" * 60)
print(f"Dataset: {n_samples} samples, {n_features} features")
print()
 
# Exact (for comparison)
import time
start = time.time()
exact_feat, exact_thresh, exact_gain = exact_greedy_split(X, g, h)
exact_time = time.time() - start
print(f"Exact Greedy:")
print(f"  Best feature: {exact_feat}, threshold: {exact_thresh:.4f}")
print(f"  Gain: {exact_gain:.4f}")
print(f"  Time: {exact_time:.4f}s")
print()
 
# Approximate with different numbers of bins
for n_bins in [16, 64, 256]:
    candidates = compute_quantile_candidates(X, n_bins)
    start = time.time()
    approx_feat, approx_thresh, approx_gain = approximate_split(
        X, g, h, candidates
    )
    approx_time = time.time() - start
    
    total_candidates = sum(len(c) for c in candidates)
    gain_ratio = approx_gain / exact_gain if exact_gain != 0 else 1.0
    
    print(f"Approximate ({n_bins} bins, {total_candidates} total candidates):")
    print(f"  Best feature: {approx_feat}, threshold: {approx_thresh:.4f}")
    print(f"  Gain: {approx_gain:.4f} ({gain_ratio:.1%} of exact)")
    print(f"  Time: {approx_time:.4f}s ({exact_time/approx_time:.1f}x speedup)")
    print()

Weighted Quantile Sketch

Standard quantile methods treat all samples equally. But XGBoost's second-order approximation introduces a key insight: samples should be weighted by their Hessians.

Why Hessian Weighting?

Recall the approximated objective: $$\tilde{\mathcal{L}} = \sum_i \left[ g_i f(x_i) + \frac{1}{2} h_i f(x_i)^2 \right]$$

This can be rewritten as: $$\tilde{\mathcal{L}} = \sum_i \frac{1}{2} h_i \left( f(x_i) - \left(-\frac{g_i}{h_i}\right) \right)^2 + \text{constant}$$

This is a weighted least squares problem where:

Target values are $-g_i/h_i$ (the per-sample optimal steps)
Weights are $h_i$ (the Hessians)

Therefore, when computing quantiles to find good split candidates, we should use Hessian-weighted quantiles!

Weighted Quantile Definition

For a feature with values $(x_1, ..., x_n)$ and corresponding weights $(h_1, ..., h_n)$, the $\phi$-quantile is the value $z$ such that:

$$\frac{\sum_{i: x_i < z} h_i}{\sum_{i=1}^n h_i} < \phi \leq \frac{\sum_{i: x_i \leq z} h_i}{\sum_{i=1}^n h_i}$$

In other words, $\phi$ fraction of the weighted mass falls below $z$.

The Quantile Sketch Problem

For streaming or distributed data, we cannot store all values. XGBoost uses quantile sketching algorithms that:

Maintain an approximate summary of the data distribution
Support merging summaries from distributed workers
Provide provable error bounds on quantile accuracy

The algorithm used is based on the Q-digest or GK sketch with extensions for weighted samples.

Why Weighted Quantiles Matter

For classification with logistic loss, samples near the decision boundary (p ≈ 0.5) have higher Hessians h = p(1-p). Weighted quantiles ensure more candidate split points in these uncertain regions, where fine-grained decisions matter most. Samples with confident predictions (low h) get fewer candidates, which is appropriate since their contribution to the objective is smaller.

weighted_quantile.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
 
def weighted_quantile(values: np.ndarray, weights: np.ndarray, 
                      percentiles: np.ndarray) -> np.ndarray:
    """
    Compute weighted quantiles.
    
    Parameters:
    -----------
    values : array of shape (n,)
        Feature values
    weights : array of shape (n,)
        Weights (Hessians in XGBoost context)
    percentiles : array
        Percentiles to compute (0-100 scale)
        
    Returns:
    --------
    quantiles : array
        Weighted quantile values
    """
    # Sort by values
    sorted_idx = np.argsort(values)
    sorted_values = values[sorted_idx]
    sorted_weights = weights[sorted_idx]
    
    # Cumulative weight sum
    cumsum = np.cumsum(sorted_weights)
    total = cumsum[-1]
    
    # Normalize to 0-100 scale
    cumsum_pct = (cumsum / total) * 100
    
    # Find quantiles
    quantiles = np.interp(percentiles, cumsum_pct, sorted_values)
    
    return quantiles
 
 
def compare_weighted_unweighted_quantiles():
    """
    Demonstrate the difference between weighted and unweighted quantiles.
    """
    np.random.seed(42)
    
    # Simulate binary classification scenario
    n = 1000
    
    # Feature values (predictive of class)
    class_0 = np.random.randn(n // 2) - 1  # Class 0 samples
    class_1 = np.random.randn(n // 2) + 1  # Class 1 samples
    X = np.concatenate([class_0, class_1])
    
    # Simulate predictions from previous round
    # Samples far from boundary have extreme predictions
    logits = X * 0.5  # Correlated with feature
    probs = 1 / (1 + np.exp(-logits))
    
    # Hessians for logistic loss: h = p(1-p)
    h = probs * (1 - probs)
    
    print("Weighted vs Unweighted Quantile Comparison")
    print("=" * 60)
    print(f"Scenario: Binary classification with {n} samples")
    print()
    
    # Compute quantile candidates
    percentiles = np.array([10, 25, 50, 75, 90])
    
    # Unweighted (standard) quantiles
    unweighted = np.percentile(X, percentiles)
    
    # Weighted quantiles (by Hessian)
    weighted = weighted_quantile(X, h, percentiles)
    
    print("Quantile candidates (split thresholds):")
    print("-" * 60)
    print(f"{'Percentile':>12} {'Unweighted':>15} {'Weighted (by h)':>18}")
    print("-" * 60)
    for p, uw, w in zip(percentiles, unweighted, weighted):
        print(f"{p:>12}% {uw:>15.4f} {w:>18.4f}")
    
    print()
    print("Analysis:")
    print("-" * 60)
    
    # Show Hessian distribution
    boundary_mask = np.abs(X) < 1  # Near decision boundary
    h_boundary = np.mean(h[boundary_mask])
    h_extreme = np.mean(h[~boundary_mask])
    
    print(f"Mean Hessian near boundary (|X| < 1): {h_boundary:.4f}")
    print(f"Mean Hessian away from boundary: {h_extreme:.4f}")
    print(f"Ratio: {h_boundary/h_extreme:.2f}x")
    print()
    print("Weighted quantiles shift toward the boundary region where")
    print("samples have higher Hessians (more uncertain predictions).")
    print("This places more candidate split points where they matter most!")
 
compare_weighted_unweighted_quantiles()

Histogram-Based Splitting

The state-of-the-art in XGBoost split finding is histogram-based methods. This approach, enabled by the tree_method='hist' parameter (now default), provides dramatic speedups.

Core Idea

Instead of considering continuous feature values, we:

Pre-discretize each feature into a fixed number of bins (e.g., 256)
Build histograms of gradient statistics $(G, H)$ per bin
Find splits by scanning histogram bins instead of individual samples

The Histogram Trick

For each feature and each leaf node, we maintain a histogram:

histogram[bin_id] = (sum_of_g_in_bin, sum_of_h_in_bin)

Building the histogram takes $O(n)$ (one pass through samples). Finding the best split takes $O(\text{bins})$ (scan the histogram).

Since $\text{bins} \ll n$ (typically 256 vs millions), this is a massive speedup!

Histogram Subtraction Trick

An additional optimization: for binary trees, we only need to build the histogram for one child explicitly. The other child's histogram is:

$$\text{histogram}{\text{sibling}} = \text{histogram}{\text{parent}} - \text{histogram}_{\text{child}}$$

This halves the histogram construction time.

Memory Efficiency

Histograms use fixed memory regardless of dataset size:

Memory per node = $\text{features} \times \text{bins} \times 2$ (for G and H)
For 500 features and 256 bins: ~2 MB per leaf

This contrasts with exact methods that need sample-level storage.

Complexity Comparison: Exact vs Histogram
Operation	Exact Greedy	Histogram-Based
Build structure	O(d·n·log n) sort	O(n) bin mapping
Per-node split	O(d·n) scan	O(d·bins) scan
Memory per node	O(n) samples	O(d·bins) (fixed)
Cache efficiency	Poor (random access)	Excellent
Parallelization	Limited	Highly parallel

histogram_split_finding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
import numpy as np
from typing import Tuple, List
 
class HistogramSplitFinder:
    """
    Histogram-based split finding for gradient boosting.
    """
    
    def __init__(self, n_bins: int = 256):
        self.n_bins = n_bins
        self.bin_edges = None  # Per-feature bin edges
        
    def fit_bin_edges(self, X: np.ndarray):
        """
        Compute bin edges for each feature (done once before training).
        """
        n_features = X.shape[1]
        self.bin_edges = []
        
        for j in range(n_features):
            # Compute percentile-based bin edges
            edges = np.percentile(
                X[:, j], 
                np.linspace(0, 100, self.n_bins + 1)
            )
            # Ensure uniqueness
            edges = np.unique(edges)
            self.bin_edges.append(edges)
    
    def bin_data(self, X: np.ndarray) -> np.ndarray:
        """
        Convert continuous features to bin indices.
        """
        n_samples, n_features = X.shape
        binned = np.zeros((n_samples, n_features), dtype=np.uint8)
        
        for j in range(n_features):
            binned[:, j] = np.digitize(X[:, j], self.bin_edges[j][1:-1])
        
        return binned
    
    def build_histogram(self, binned_data: np.ndarray, g: np.ndarray, 
                        h: np.ndarray, feature_idx: int) -> Tuple[np.ndarray, np.ndarray]:
        """
        Build histogram for a single feature.
        Returns (G_per_bin, H_per_bin).
        """
        n_bins = len(self.bin_edges[feature_idx]) - 1
        G_hist = np.zeros(n_bins)
        H_hist = np.zeros(n_bins)
        
        # Aggregate gradients into bins
        bins = binned_data[:, feature_idx]
        np.add.at(G_hist, bins.astype(int), g)
        np.add.at(H_hist, bins.astype(int), h)
        
        return G_hist, H_hist
    
    def find_best_split_from_histogram(
        self, 
        G_hist: np.ndarray, 
        H_hist: np.ndarray,
        feature_idx: int,
        lambda_: float = 1.0,
        gamma: float = 0.0
    ) -> Tuple[float, float]:
        """
        Find best split by scanning histogram.
        """
        G_total = np.sum(G_hist)
        H_total = np.sum(H_hist)
        score_parent = (G_total ** 2) / (H_total + lambda_)
        
        best_gain = -np.inf
        best_threshold = None
        
        G_left = 0.0
        H_left = 0.0
        
        bin_edges = self.bin_edges[feature_idx]
        
        for bin_idx in range(len(G_hist) - 1):
            G_left += G_hist[bin_idx]
            H_left += H_hist[bin_idx]
            G_right = G_total - G_left
            H_right = H_total - H_left
            
            if H_left < 1.0 or H_right < 1.0:
                continue
            
            score_left = (G_left ** 2) / (H_left + lambda_)
            score_right = (G_right ** 2) / (H_right + lambda_)
            gain = 0.5 * (score_left + score_right - score_parent) - gamma
            
            if gain > best_gain:
                best_gain = gain
                # Threshold is the right edge of current bin
                best_threshold = bin_edges[bin_idx + 1]
        
        return best_threshold, best_gain
    
    def find_best_split(
        self,
        X_binned: np.ndarray,
        g: np.ndarray,
        h: np.ndarray,
        lambda_: float = 1.0,
        gamma: float = 0.0
    ) -> Tuple[int, float, float]:
        """
        Find the best split across all features using histograms.
        """
        n_features = X_binned.shape[1]
        
        best_gain = -np.inf
        best_feature = -1
        best_threshold = None
        
        for j in range(n_features):
            # Build histogram for this feature
            G_hist, H_hist = self.build_histogram(X_binned, g, h, j)
            
            # Find best split from histogram
            threshold, gain = self.find_best_split_from_histogram(
                G_hist, H_hist, j, lambda_, gamma
            )
            
            if gain > best_gain:
                best_gain = gain
                best_feature = j
                best_threshold = threshold
        
        return best_feature, best_threshold, best_gain
 
 
# Demonstration
np.random.seed(42)
n_samples = 100000  # Large dataset
n_features = 50
 
X = np.random.randn(n_samples, n_features)
y = np.sin(X[:, 0] * 2) + 0.5 * X[:, 1] + np.random.randn(n_samples) * 0.1
g = -y
h = np.ones(n_samples)
 
print("Histogram-Based Split Finding")
print("=" * 60)
print(f"Dataset: {n_samples:,} samples, {n_features} features")
print()
 
# Initialize and fit histogram finder
hist_finder = HistogramSplitFinder(n_bins=256)
hist_finder.fit_bin_edges(X)
 
import time
 
# Bin the data (done once)
start = time.time()
X_binned = hist_finder.bin_data(X)
bin_time = time.time() - start
print(f"Binning time: {bin_time:.4f}s (one-time preprocessing)")
 
# Find split using histogram
start = time.time()
feature, threshold, gain = hist_finder.find_best_split(X_binned, g, h)
split_time = time.time() - start
 
print(f"Split finding time: {split_time:.4f}s")
print(f"Best split: feature {feature}, threshold {threshold:.4f}, gain {gain:.4f}")
print()
 
# Compare with exact (on smaller subset for fairness)
print("Comparison with Exact Greedy (on 10K sample subset):")
X_small = X[:10000]
g_small = g[:10000]
h_small = h[:10000]
 
start = time.time()
exact_feat, exact_thresh, exact_gain = exact_greedy_split(X_small, g_small, h_small)
exact_time = time.time() - start
print(f"  Exact on 10K samples: {exact_time:.4f}s")
print(f"  Histogram on 100K samples: {split_time:.4f}s")
print(f"  Histogram processes 10x more data in similar time!")

XGBoost's tree_method Parameter

XGBoost exposes the split-finding algorithm through the tree_method parameter. Understanding this parameter helps you choose the right algorithm for your use case.

Available Options

XGBoost tree_method Options
Value	Algorithm	Memory	Speed	Best For
exact	Exact greedy	High (in-memory)	Slow	Small datasets (<10K rows)
approx	Approximate (quantile)	Medium	Medium	Medium datasets
hist	Histogram-based	Low	Fast	Large datasets (default)
gpu_hist	Histogram on GPU	GPU memory	Very fast	GPU available, large data

Recommendation

For most use cases, tree_method='hist' (now the default) is the best choice. It provides:

Near-exact split quality with 256 bins
10-100x speedup over exact methods
Low memory footprint
Excellent CPU cache utilization

Use tree_method='exact' only when:

Dataset is very small (<10K samples)
You need provably optimal splits
You're debugging or comparing methods

Use tree_method='gpu_hist' when:

You have a compatible NVIDIA GPU
Dataset is large enough to benefit from GPU parallelism
Training time is critical

xgboost_tree_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import xgboost as xgb
import numpy as np
import time
 
# Generate sample data
np.random.seed(42)
n_samples = 50000
n_features = 100
 
X = np.random.randn(n_samples, n_features)
y = (X[:, 0] + X[:, 1] > 0).astype(int)  # Binary classification
 
dtrain = xgb.DMatrix(X, label=y)
 
# Common parameters
base_params = {
    'objective': 'binary:logistic',
    'max_depth': 6,
    'learning_rate': 0.1,
    'n_jobs': -1,
    'verbosity': 0
}
 
print("XGBoost tree_method Comparison")
print("=" * 60)
print(f"Dataset: {n_samples:,} samples, {n_features} features")
print()
 
# Compare different tree methods
for tree_method in ['exact', 'approx', 'hist']:
    params = {**base_params, 'tree_method': tree_method}
    
    start = time.time()
    model = xgb.train(params, dtrain, num_boost_round=100)
    elapsed = time.time() - start
    
    print(f"tree_method='{tree_method}':")
    print(f"  Training time: {elapsed:.2f}s")
    print()
 
# The 'hist' method is usually 5-20x faster than 'exact'
# with virtually identical prediction quality
 
print("Practical Recommendations:")
print("-" * 60)
print("1. Use 'hist' (default) for most cases")
print("2. Use 'gpu_hist' if you have a GPU and large data")
print("3. Use 'exact' only for small data or debugging")
print("4. Use 'approx' rarely - 'hist' is usually better")

The max_bin Parameter

When using tree_method='hist', the max_bin parameter controls the number of histogram bins (default 256). Increasing it (e.g., 512) can slightly improve split quality at the cost of more memory and computation. Decreasing it (e.g., 64) saves memory for very large datasets. For most cases, 256 is a good balance.

Cache-Aware Access Patterns

Beyond algorithmic complexity, XGBoost optimizes for cache efficiency—a crucial factor for modern CPU performance.

The Memory Hierarchy Problem

Modern CPUs have a memory hierarchy:

L1 Cache: ~32 KB, ~4 cycles latency
L2 Cache: ~256 KB, ~12 cycles latency
L3 Cache: ~20 MB, ~40 cycles latency
Main Memory (RAM): GBs, ~200+ cycles latency

Accessing data not in cache triggers expensive memory fetches. For large datasets, random memory access patterns can slow algorithms by 10-100x.

XGBoost's Cache Optimizations

Block-Based Data Storage
- Data is stored in compressed column (CSC) blocks
- Each block fits in CPU cache
- Processing happens block-by-block
Sorted Index Reuse
- For approximate methods, sorted indices are computed once
- Indices are stored contiguously for cache-friendly access
Gradient Statistics Buffering
- Gradient/Hessian statistics are accumulated in cache-sized buffers
- Reduces random memory writes

Histogram Method's Cache Advantage

The histogram method is particularly cache-friendly:

Sequential histogram construction: Single pass through data, writing to a small histogram buffer
Small histogram size: 256 bins × 2 (G, H) × 8 bytes = 4 KB per feature — fits in L1 cache!
Predictable access patterns: Unlike exact methods with random sample accesses

This is why tree_method='hist' often provides speedups beyond what simple complexity analysis predicts.

Cache Efficiency Best Practices

•Use histogram method — tree_method='hist' has best cache behavior
•Tune max_bin — Smaller bins (64-256) fit better in cache
•Consider data layout — Row-major (C-order) arrays work best
•Batch predictions — Avoid single-sample predictions; batch for cache reuse
•Feature engineering — Fewer features means smaller histograms, better cache fit

Why Cache Matters for ML

Many ML researchers focus only on algorithmic complexity (Big-O). But in practice, cache behavior can dominate runtime for large datasets. An O(n) algorithm with poor cache behavior can be slower than an O(n log n) algorithm with good cache behavior. XGBoost's attention to these details is part of why it's so fast.

Summary: Efficient Split Finding

We have explored the algorithms that make XGBoost's split finding efficient and scalable.

Key Takeaways

•Exact greedy evaluates all possible splits — optimal but O(d·n·log n) per split
•Approximate methods use quantile candidates — reduces candidates from O(n) to O(q)
•Weighted quantiles weight by Hessians — focuses candidates where they matter most
•Histogram-based splitting discretizes features — O(n) to build, O(bins) to find split
•tree_method='hist' is the modern default — fast, memory-efficient, cache-friendly
•Cache efficiency is crucial — histogram method fits in L1 cache for huge speedups

What's Next

With split finding covered, we'll explore XGBoost's sparsity-aware algorithm—how it efficiently handles missing values and sparse data. This is crucial for real-world datasets where missing values are common and categorical variables create sparse one-hot encodings.

Page Complete

You now understand XGBoost's efficient split finding algorithms—from exact greedy to histogram-based methods. These algorithms enable XGBoost to scale to datasets with millions of samples while maintaining high split quality. The histogram method, in particular, is a key innovation that makes XGBoost practical for large-scale machine learning.