Machine LearningDecision Trees

Splitting Criteria

LevelIntermediate

Duration90 mins

TopicDecision Trees

5 / 5

Continuous Feature Splitting: Finding Optimal Thresholds

From Categorical to Continuous

So far, we've primarily discussed splitting criteria in the context of categorical features: 'Is Outlook sunny, overcast, or rainy?' But real-world data is dominated by continuous features: age, income, temperature, transaction amount, sensor readings.

Continuous features present a fundamental challenge: infinitely many possible split points. For a feature with values ${3.2, 1.7, 5.1, 4.8, ...}$, we could split at 2.0, 2.5, 3.0, or any real number in between. How do we efficiently find the best threshold?

This page reveals that despite the apparent infinity of choices, we only need to examine finitely many candidate thresholds—and we can do it efficiently with clever algorithms.

What You Will Learn

By the end of this page, you will understand how continuous features are discretized into binary splits, master the efficient threshold search algorithm, handle special cases like ties and missing values, and appreciate the computational complexity of tree construction.

The Binary Split Strategy

For continuous features, decision trees use binary thresholding:

$$\text{Split: } X_j \leq t \text{ vs } X_j > t$$

where $X_j$ is the $j$-th feature and $t$ is the threshold.

Why Binary Splits?

Simplicity: Only one parameter ($t$) to optimize
Efficiency: Can be found in $O(n \log n)$ time
Universal applicability: Works for any continuous distribution
Depth compensation: Multiple binary splits can approximate any partition

Contrast with Categorical:

For categorical features with $V$ values, we could create $V$ branches. For continuous features, creating branches for every distinct value would cause extreme overfitting. Binary splits provide regularization.

The Key Insight:

We don't need to test every real number. Only thresholds between consecutive distinct values (after sorting) can produce different partitions.

Finite Candidates from Infinite Possibilities

For n samples, there are at most n-1 distinct boundaries between consecutive sorted values. Testing between 3.2 and 3.5 is equivalent to testing at 3.3, 3.35, or any value in between—they all produce the same partition. This reduces infinite choices to O(n) candidates.

The Threshold Search Algorithm

Here's the efficient algorithm for finding the optimal threshold:

Algorithm: Best Threshold for Continuous Feature

Input: Feature values X_j, labels y, impurity function I
Output: Best threshold t*, impurity reduction ΔI*

1. Sort samples by feature value: (x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)
   where x₁ ≤ x₂ ≤ ... ≤ xₙ

2. Initialize:
   - left_counts = [0, 0, ..., 0]  (K classes)
   - right_counts = counts(y)      (all samples start on right)
   
3. For i = 1 to n-1:
   a. Move sample i from right to left:
      left_counts[yᵢ] += 1
      right_counts[yᵢ] -= 1
      
   b. If xᵢ < xᵢ₊₁ (valid threshold position):
      - Compute weighted impurity of children
      - Update best if improvement found
      - Candidate threshold: t = (xᵢ + xᵢ₊₁) / 2

4. Return t*, ΔI*

Complexity: $O(n \log n)$ for sorting + $O(nK)$ for scanning = $O(n \log n)$ for most cases.

threshold_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from typing import Tuple, Callable
 
def gini_impurity(counts: np.ndarray) -> float:
    """Compute Gini impurity from class counts."""
    n = counts.sum()
    if n == 0:
        return 0.0
    p = counts / n
    return 1.0 - np.sum(p ** 2)
 
 
def find_best_threshold(X_j: np.ndarray, 
                        y: np.ndarray,
                        n_classes: int,
                        impurity_fn: Callable = gini_impurity
                        ) -> Tuple[float, float, float]:
    """
    Find the optimal threshold for a continuous feature.
    
    This is the core algorithm used in CART, Random Forests, and
    gradient boosting for continuous feature splitting.
    
    Args:
        X_j: Feature values for one feature (n_samples,)
        y: Class labels (n_samples,) with values in 0..n_classes-1
        n_classes: Number of classes
        impurity_fn: Impurity function (default: Gini)
        
    Returns:
        (best_threshold, best_gain, best_weighted_impurity)
        
    Complexity:
        O(n log n) for sorting + O(n * K) for scanning
        = O(n log n) when K << n
    """
    n = len(y)
    
    # Step 1: Sort by feature value
    sorted_indices = np.argsort(X_j)
    sorted_features = X_j[sorted_indices]
    sorted_labels = y[sorted_indices]
    
    # Compute parent impurity
    parent_counts = np.bincount(y, minlength=n_classes)
    parent_impurity = impurity_fn(parent_counts)
    
    # Step 2: Initialize counts
    left_counts = np.zeros(n_classes, dtype=np.int64)
    right_counts = parent_counts.copy()
    
    best_gain = -np.inf
    best_threshold = None
    best_weighted_impurity = None
    
    left_size = 0
    right_size = n
    
    # Step 3: Scan through all split positions
    for i in range(n - 1):
        # Move sample i from right to left
        c = sorted_labels[i]
        left_counts[c] += 1
        right_counts[c] -= 1
        left_size += 1
        right_size -= 1
        
        # Skip if same value (not a valid split point)
        if sorted_features[i] >= sorted_features[i + 1]:
            continue
        
        # Compute weighted impurity
        left_impurity = impurity_fn(left_counts)
        right_impurity = impurity_fn(right_counts)
        
        weighted_impurity = (
            (left_size / n) * left_impurity + 
            (right_size / n) * right_impurity
        )
        
        gain = parent_impurity - weighted_impurity
        
        if gain > best_gain:
            best_gain = gain
            # Threshold midpoint between consecutive values
            best_threshold = (sorted_features[i] + sorted_features[i + 1]) / 2
            best_weighted_impurity = weighted_impurity
    
    return best_threshold, best_gain, best_weighted_impurity
 
 
def demonstration():
    """Demonstrate threshold finding on example data."""
    np.random.seed(42)
    
    print("Continuous Feature Threshold Search")
    print("=" * 60)
    
    # Create separable data
    n = 100
    X = np.concatenate([
        np.random.normal(2.0, 0.5, n//2),  # Class 0 centered at 2
        np.random.normal(5.0, 0.5, n//2)   # Class 1 centered at 5
    ])
    y = np.array([0]*(n//2) + [1]*(n//2))
    
    print(f"Data: {n} samples, 2 classes")
    print(f"Class 0 centered at x=2.0")
    print(f"Class 1 centered at x=5.0")
    print(f"Parent Gini: {gini_impurity(np.bincount(y, minlength=2)):.4f}")
    
    threshold, gain, weighted_imp = find_best_threshold(X, y, n_classes=2)
    
    print(f"\nOptimal split:")
    print(f"  Threshold: x ≤ {threshold:.3f}")
    print(f"  Gini Gain: {gain:.4f}")
    print(f"  Weighted Impurity: {weighted_imp:.4f}")
    
    # Verify split
    left_samples = X <= threshold
    right_samples = X > threshold
    
    print(f"\nSplit result:")
    print(f"  Left (x ≤ {threshold:.2f}): {left_samples.sum()} samples")
    print(f"    Class 0: {(y[left_samples] == 0).sum()}")
    print(f"    Class 1: {(y[left_samples] == 1).sum()}")
    print(f"  Right (x > {threshold:.2f}): {right_samples.sum()} samples")
    print(f"    Class 0: {(y[right_samples] == 0).sum()}")
    print(f"    Class 1: {(y[right_samples] == 1).sum()}")
    
    # More complex example
    print(f"\n" + "=" * 60)
    print("Multi-modal Data (3 classes)")
    print("=" * 60)
    
    X_complex = np.concatenate([
        np.random.normal(1.0, 0.3, 30),   # Class 0
        np.random.normal(3.0, 0.3, 40),   # Class 1
        np.random.normal(5.0, 0.3, 30)    # Class 2
    ])
    y_complex = np.array([0]*30 + [1]*40 + [2]*30)
    
    threshold, gain, _ = find_best_threshold(X_complex, y_complex, n_classes=3)
    print(f"Best first split: x ≤ {threshold:.3f}, Gain = {gain:.4f}")
 
 
if __name__ == "__main__":
    demonstration()

Computational Optimizations

Tree construction speed is critical—modern ensembles build thousands of trees. Here are key optimizations:

1. Pre-sorting

Sort each feature once at the root, then propagate sorted indices to children. Reduces $O(n \log n)$ sorting at each node to a single pass.

2. Histogram-Based Splitting

Instead of exact thresholds, discretize features into 256 bins:

LightGBM, CatBoost use this approach
Reduces $O(n)$ scan to $O(\text{bins}) = O(256)$
Slight accuracy loss, huge speed gain

3. Sample Subsampling

For large datasets, evaluate thresholds on a random subset:

Row subsampling (stochastic gradient boosting)
Statistical estimation of best threshold

4. Early Stopping

Stop searching if no split can beat current best:

Upper bound on remaining impurity reduction
Prune search space dynamically

histogram_splitting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import numpy as np
from typing import Tuple
 
def histogram_based_split(X_j: np.ndarray, 
                          y: np.ndarray,
                          n_classes: int,
                          n_bins: int = 256
                          ) -> Tuple[float, float]:
    """
    Histogram-based threshold finding (LightGBM/CatBoost style).
    
    Instead of evaluating every sample boundary, discretize into
    bins and evaluate bin boundaries. Much faster for large n.
    
    Complexity: O(n) for binning + O(bins * K) for search
              = O(n) when bins is constant (e.g., 256)
    """
    n = len(y)
    
    # Create histogram bins
    # Use percentile-based binning for better distribution
    percentiles = np.linspace(0, 100, n_bins + 1)
    bin_edges = np.percentile(X_j, percentiles)
    bin_edges = np.unique(bin_edges)  # Remove duplicates
    actual_bins = len(bin_edges) - 1
    
    # Assign samples to bins (vectorized)
    bin_indices = np.digitize(X_j, bin_edges[1:-1])
    
    # Build histogram: counts per bin per class
    # histogram[b, c] = count of class c in bin b
    histogram = np.zeros((actual_bins, n_classes), dtype=np.int64)
    for b in range(actual_bins):
        mask = (bin_indices == b)
        if mask.any():
            histogram[b] = np.bincount(y[mask], minlength=n_classes)
    
    # Compute cumulative sums for efficient scan
    cumsum = np.cumsum(histogram, axis=0)
    total = cumsum[-1]
    
    def gini_from_counts(counts):
        n = counts.sum()
        if n == 0:
            return 0.0
        p = counts / n
        return 1.0 - np.sum(p ** 2)
    
    parent_gini = gini_from_counts(total)
    
    best_gain = -np.inf
    best_bin = None
    
    # Scan bin boundaries
    for b in range(actual_bins - 1):
        left_counts = cumsum[b]
        right_counts = total - left_counts
        
        left_n = left_counts.sum()
        right_n = right_counts.sum()
        
        if left_n == 0 or right_n == 0:
            continue
        
        left_gini = gini_from_counts(left_counts)
        right_gini = gini_from_counts(right_counts)
        
        weighted_gini = (left_n/n) * left_gini + (right_n/n) * right_gini
        gain = parent_gini - weighted_gini
        
        if gain > best_gain:
            best_gain = gain
            best_bin = b
    
    if best_bin is None:
        return None, 0.0
    
    # Threshold is upper edge of best bin
    best_threshold = bin_edges[best_bin + 1]
    
    return best_threshold, best_gain
 
 
def benchmark_comparison():
    """Compare exact vs histogram-based splitting speed."""
    import time
    
    print("Speed Comparison: Exact vs Histogram-Based")
    print("=" * 60)
    
    sizes = [1000, 10000, 100000, 1000000]
    
    for n in sizes:
        X = np.random.randn(n)
        y = (X > 0).astype(int)
        
        # Time histogram-based
        start = time.perf_counter()
        for _ in range(3):
            histogram_based_split(X, y, n_classes=2, n_bins=256)
        hist_time = (time.perf_counter() - start) / 3
        
        print(f"n={n:>8,}: Histogram = {hist_time*1000:>8.2f} ms")
 
 
if __name__ == "__main__":
    # Example usage
    np.random.seed(42)
    X = np.random.randn(1000)
    y = (X > 0.5).astype(int)
    
    thresh, gain = histogram_based_split(X, y, n_classes=2)
    print(f"Histogram split: threshold={thresh:.3f}, gain={gain:.4f}")
    
    print()
    benchmark_comparison()

Complexity Comparison: Splitting Algorithms
Algorithm	Time per Feature	Used In
Exact (naive)	O(n² log n)	Theoretical only
Exact (presort)	O(n log n)	scikit-learn
Histogram (256 bins)	O(n)	LightGBM, CatBoost
Histogram + sampling	O(subsample)	XGBoost

Handling Ties and Duplicates

When multiple samples have the same feature value, we face the tie-breaking problem. The split $x \leq t$ must place all tied samples on the same side.

The Challenge:

If 100 samples have $x = 3.5$, we cannot split them. Any threshold $t < 3.5$ puts all 100 on the right; $t \geq 3.5$ puts all 100 on the left.

Implications:

Reduced split candidates: Ties reduce the number of distinct boundaries
Forced grouping: Tied samples must stay together
Potential suboptimality: The 'true' best split might require separating ties

Strategies:

Standard approach: Only consider thresholds between distinct values (as in our algorithm)
Random tie-breaking: Add small noise to create distinctions (used in some ensembles)
Secondary features: Break ties using another feature (requires algorithm modification)
Accept limitation: In practice, continuous features rarely have many ties unless discretized or rounded

Disaster Scenario: ID or Unique Key

If every value is unique (like customer ID), there are n-1 candidates. Overfitting is severe. This is why continuous 'features' that are actually identifiers must be excluded or treated specially.

handle_ties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
 
def count_unique_splits(X_j: np.ndarray) -> int:
    """
    Count the number of unique split candidates for a feature.
    
    This is the number of distinct boundaries between consecutive
    sorted unique values.
    """
    unique_values = np.unique(X_j)
    return len(unique_values) - 1
 
 
def analyze_tie_impact(X_j: np.ndarray, y: np.ndarray) -> dict:
    """
    Analyze how ties affect splitting for a feature.
    """
    n = len(X_j)
    unique_values = np.unique(X_j)
    n_unique = len(unique_values)
    n_splits = n_unique - 1
    
    # Count samples per unique value
    value_counts = {}
    for v in unique_values:
        mask = X_j == v
        value_counts[v] = {
            'total': mask.sum(),
            'class_dist': np.bincount(y[mask], minlength=2).tolist()
        }
    
    # Find largest tie
    max_tie = max(vc['total'] for vc in value_counts.values())
    
    return {
        'n_samples': n,
        'n_unique': n_unique,
        'n_candidate_splits': n_splits,
        'max_tie_size': max_tie,
        'tie_ratio': 1 - (n_unique / n),
        'candidate_ratio': n_splits / max(n - 1, 1)
    }
 
 
# Example
print("Tie Analysis Examples")
print("=" * 60)
 
# Continuous with no ties
X_continuous = np.random.randn(100)
y = (X_continuous > 0).astype(int)
stats = analyze_tie_impact(X_continuous, y)
print(f"\nContinuous (no ties):")
for k, v in stats.items():
    print(f"  {k}: {v}")
 
# Discretized (many ties)
X_discrete = np.round(np.random.randn(100), 0)  # Round to integer
stats = analyze_tie_impact(X_discrete, y)
print(f"\nDiscretized (rounded):")
for k, v in stats.items():
    print(f"  {k}: {v}")
 
# Extreme ties (like low-precision data)
X_extreme = np.round(np.random.uniform(0, 5, 100))
stats = analyze_tie_impact(X_extreme, y)
print(f"\nExtreme ties (5 possible values):")
for k, v in stats.items():
    print(f"  {k}: {v}")

Handling Missing Values

Real datasets often have missing values. Different tree algorithms handle this differently:

1. C4.5 Approach: Fractional Splits

Distribute samples with missing values proportionally to both children:

If 60% of non-missing samples go left, put 0.6 weight on left branch
Maintain fractional sample weights through the tree
Aggregate predictions using weights

2. CART Approach: Surrogate Splits

Find alternative features that mimic the primary split:

After choosing the best split on feature $X_j$
Find the feature $X_k$ whose split best agrees with $X_j$'s partitioning
Use $X_k$ as surrogate when $X_j$ is missing

3. XGBoost/LightGBM: Default Direction

Learn which direction missing values should take:

During training, try both left and right for missing
Learn which direction gives better impurity reduction
Store the 'default direction' with the split

4. Simple Imputation

Replace missing with median/mean before splitting:

Fast but loses information about 'missingness'
May introduce bias

missing_values.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import numpy as np
from typing import Tuple
 
def find_threshold_with_missing(X_j: np.ndarray, 
                                 y: np.ndarray,
                                 n_classes: int = 2
                                 ) -> Tuple[float, float, str]:
    """
    Find optimal threshold AND default direction for missing values.
    
    This is the XGBoost/LightGBM approach: learn which side
    missing values should go.
    """
    # Separate missing and non-missing
    missing_mask = np.isnan(X_j)
    non_missing_mask = ~missing_mask
    
    X_valid = X_j[non_missing_mask]
    y_valid = y[non_missing_mask]
    
    X_missing = X_j[missing_mask]
    y_missing = y[missing_mask]
    
    n_valid = len(X_valid)
    n_missing = len(X_missing)
    n_total = n_valid + n_missing
    
    if n_valid == 0:
        return None, 0.0, None
    
    # Find best threshold on non-missing
    sorted_idx = np.argsort(X_valid)
    sorted_X = X_valid[sorted_idx]
    sorted_y = y_valid[sorted_idx]
    
    missing_counts = np.bincount(y_missing, minlength=n_classes) if n_missing > 0 else np.zeros(n_classes)
    
    def gini(counts):
        n = counts.sum()
        if n == 0: return 0
        p = counts / n
        return 1 - np.sum(p ** 2)
    
    parent_counts = np.bincount(y, minlength=n_classes)
    parent_gini = gini(parent_counts)
    
    left_counts_valid = np.zeros(n_classes, dtype=np.int64)
    right_counts_valid = np.bincount(y_valid, minlength=n_classes)
    
    best_gain = -np.inf
    best_threshold = None
    best_direction = None
    
    left_valid_n = 0
    right_valid_n = n_valid
    
    for i in range(n_valid - 1):
        c = sorted_y[i]
        left_counts_valid[c] += 1
        right_counts_valid[c] -= 1
        left_valid_n += 1
        right_valid_n -= 1
        
        if sorted_X[i] >= sorted_X[i + 1]:
            continue
        
        threshold = (sorted_X[i] + sorted_X[i + 1]) / 2
        
        # Try missing going LEFT
        left_total = left_counts_valid + missing_counts
        right_total = right_counts_valid
        n_left = left_valid_n + n_missing
        n_right = right_valid_n
        
        w_gini_left = (n_left/n_total) * gini(left_total) + (n_right/n_total) * gini(right_total)
        gain_left = parent_gini - w_gini_left
        
        # Try missing going RIGHT
        left_total = left_counts_valid
        right_total = right_counts_valid + missing_counts
        n_left = left_valid_n
        n_right = right_valid_n + n_missing
        
        w_gini_right = (n_left/n_total) * gini(left_total) + (n_right/n_total) * gini(right_total)
        gain_right = parent_gini - w_gini_right
        
        # Choose best direction
        if gain_left > gain_right and gain_left > best_gain:
            best_gain = gain_left
            best_threshold = threshold
            best_direction = 'left'
        elif gain_right > best_gain:
            best_gain = gain_right
            best_threshold = threshold
            best_direction = 'right'
    
    return best_threshold, best_gain, best_direction
 
 
# Example
print("Missing Value Handling Example")
print("=" * 60)
 
np.random.seed(42)
n = 100
X = np.random.randn(n)
y = (X > 0).astype(int)
 
# Introduce 10% missing values
missing_idx = np.random.choice(n, 10, replace=False)
X_with_missing = X.copy()
X_with_missing[missing_idx] = np.nan
 
print(f"Total samples: {n}")
print(f"Missing: {np.isnan(X_with_missing).sum()}")
 
thresh, gain, direction = find_threshold_with_missing(X_with_missing, y)
print(f"\nOptimal split:")
print(f"  Threshold: {thresh:.3f}")
print(f"  Gain: {gain:.4f}")
print(f"  Missing goes: {direction}")
print(f"\nSplit rule: IF x ≤ {thresh:.3f} OR (x is missing AND direction='{direction}')")
print("            → go LEFT")

XGBoost's Clever Solution

XGBoost's 'sparsity-aware' splitting is elegant: it treats missing values as just another category and learns the optimal direction. This often outperforms imputation because 'missingness' itself can be informative.

Beyond Binary: Multi-way Splits for Continuous Features

While binary splits are standard, some scenarios benefit from multi-way splits:

1. Discretization Approaches

Pre-bin continuous features into intervals:

Equal-width: Divide range into $k$ equal intervals
Equal-frequency: Each bin has ~same number of samples
Supervised (e.g., MDLP): Minimize entropy across bins

2. Natural Breakpoints

Some features have meaningful thresholds:

Age: child (<13), teen (13-19), adult (20-64), senior (65+)
Income: tax brackets
Temperature: below/at/above freezing

3. Recursive Binary Splits

Binary splits on the same feature at different levels:

First split: age ≤ 30
Then: age ≤ 50 in the right branch
Effectively creates intervals [0,30], (30,50], (50,∞)

Why Binary Dominates:

Simpler optimization (one threshold vs. multiple)
Tree depth compensates (can always split more)
No need to pre-specify number of bins
Handles arbitrary distributions

Modern Practice

LightGBM, CatBoost, and XGBoost all use binary splits for continuous features. Histogram binning (256 bins) approximates multi-way splitting's benefits while maintaining binary split simplicity.

Full Tree Construction Complexity

Let's analyze the total complexity of building a decision tree with continuous features.

Single Node Split:

For each of $d$ features:
- Sort $n$ samples: $O(n \log n)$ (or $O(n)$ if pre-sorted)
- Scan thresholds: $O(nK)$
Total: $O(d \cdot n \log n)$ per node

Full Tree:

At depth 0: 1 node, $n$ samples
At depth 1: 2 nodes, ~$n/2$ samples each
At depth $h$: up to $2^h$ nodes

If tree is balanced with depth $h$: $$\text{Total work} = \sum_{\ell=0}^{h} 2^\ell \cdot d \cdot \frac{n}{2^\ell} \log \frac{n}{2^\ell}$$

This simplifies to $O(dn \log n \cdot h)$ for balanced trees.

With pre-sorting at root:

Sort all features once: $O(dn \log n)$
Per-node work becomes $O(dn)$ (no re-sorting)
Total: $O(dn \cdot h) + O(dn \log n) = O(dn \log n)$ amortized

With histogram binning:

Bin all features once: $O(dn)$
Per-node work: $O(d \cdot \text{bins} \cdot K)$
Total: $O(dn) + O(d \cdot \text{nodes} \cdot \text{bins})$ — nearly linear!

Tree Construction Complexity Summary
Method	Per Node	Full Tree (balanced)	Used By
Naive (resort each node)	$O(dn \log n)$	$O(dn^2 \log n)$	Textbook only
Pre-sorted	$O(dn)$	$O(dn \log n)$	scikit-learn
Histogram (256 bins)	$O(d \cdot 256 \cdot K)$	$O(dn + \text{nodes} \cdot d)$	LightGBM
Subsampled histogram	$O(d \cdot 256 \cdot K)$	$O(dn)$ expected	XGBoost

Practical Considerations

Best Practices for Continuous Features

•Scaling is NOT required: Trees are invariant to monotonic transformations. No need to standardize.
•Outliers can help: Extreme values create natural split points. Don't clip aggressively.
•High cardinality is fine: Unlike some models, trees handle many unique values well.
•Feature engineering helps: Log transforms for skewed data can improve splits.
•Interactions emerge naturally: Trees find interactions through hierarchical splits.
•Missing values: Use native handling (XGBoost style) over imputation when possible.

No Normalization Needed

Unlike neural networks or SVMs, decision trees don't require feature scaling. The threshold x ≤ t is invariant to scaling: if we multiply all values by 1000, the optimal threshold is also multiplied by 1000, but the partition is identical.

Summary: Mastering Continuous Feature Splits

Key Takeaways

•Binary thresholds: $x \leq t$ vs $x > t$ transforms infinite possibilities to finite candidates.
•Efficient algorithm: Sort once, scan once — $O(n \log n)$ per feature.
•Only boundaries matter: Test thresholds between consecutive distinct values.
•Histogram binning: Reduces to $O(n)$ with 256 bins, minimal accuracy loss.
•Ties force grouping: Identical values must stay together.
•Missing values: Learn optimal direction (XGBoost) beats imputation.
•No scaling required: Trees are invariant to monotonic transformations.
•Full tree complexity: $O(dn \log n)$ with pre-sorting, ~$O(dn)$ with histograms.

Module Complete!

You've now completed the Splitting Criteria module. You understand:

Gini impurity: The default, fast, probabilistically interpretable
Entropy: Information-theoretic, good for imbalance
Misclassification error: Intuitive but unsuitable for splitting
Continuous features: Binary thresholds with efficient search

Next, you're ready for Tree Growing Algorithms (ID3, C4.5, CART) to see how these criteria integrate into complete tree construction algorithms.

Module Complete

Congratulations! You've achieved a deep, Principal Engineer-level understanding of decision tree splitting criteria. From mathematical foundations through computational optimizations, you can now reason about, implement, and tune decision tree construction with expert-level insight.

5 / 5

Loading learning content...

Machine LearningDecision Trees

Splitting Criteria

LevelIntermediate

Duration90 mins

TopicDecision Trees

5 / 5

Continuous Feature Splitting: Finding Optimal Thresholds

From Categorical to Continuous

This page reveals that despite the apparent infinity of choices, we only need to examine finitely many candidate thresholds—and we can do it efficiently with clever algorithms.

What You Will Learn

The Binary Split Strategy

For continuous features, decision trees use binary thresholding:

$$\text{Split: } X_j \leq t \text{ vs } X_j > t$$

where $X_j$ is the $j$-th feature and $t$ is the threshold.

Why Binary Splits?

Simplicity: Only one parameter ($t$) to optimize
Efficiency: Can be found in $O(n \log n)$ time
Universal applicability: Works for any continuous distribution
Depth compensation: Multiple binary splits can approximate any partition

Contrast with Categorical:

The Key Insight:

We don't need to test every real number. Only thresholds between consecutive distinct values (after sorting) can produce different partitions.

Finite Candidates from Infinite Possibilities

The Threshold Search Algorithm

Here's the efficient algorithm for finding the optimal threshold:

Algorithm: Best Threshold for Continuous Feature

Input: Feature values X_j, labels y, impurity function I
Output: Best threshold t*, impurity reduction ΔI*

1. Sort samples by feature value: (x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)
   where x₁ ≤ x₂ ≤ ... ≤ xₙ

2. Initialize:
   - left_counts = [0, 0, ..., 0]  (K classes)
   - right_counts = counts(y)      (all samples start on right)
   
3. For i = 1 to n-1:
   a. Move sample i from right to left:
      left_counts[yᵢ] += 1
      right_counts[yᵢ] -= 1
      
   b. If xᵢ < xᵢ₊₁ (valid threshold position):
      - Compute weighted impurity of children
      - Update best if improvement found
      - Candidate threshold: t = (xᵢ + xᵢ₊₁) / 2

4. Return t*, ΔI*

Complexity: $O(n \log n)$ for sorting + $O(nK)$ for scanning = $O(n \log n)$ for most cases.

threshold_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from typing import Tuple, Callable
 
def gini_impurity(counts: np.ndarray) -> float:
    """Compute Gini impurity from class counts."""
    n = counts.sum()
    if n == 0:
        return 0.0
    p = counts / n
    return 1.0 - np.sum(p ** 2)
 
 
def find_best_threshold(X_j: np.ndarray, 
                        y: np.ndarray,
                        n_classes: int,
                        impurity_fn: Callable = gini_impurity
                        ) -> Tuple[float, float, float]:
    """
    Find the optimal threshold for a continuous feature.
    
    This is the core algorithm used in CART, Random Forests, and
    gradient boosting for continuous feature splitting.
    
    Args:
        X_j: Feature values for one feature (n_samples,)
        y: Class labels (n_samples,) with values in 0..n_classes-1
        n_classes: Number of classes
        impurity_fn: Impurity function (default: Gini)
        
    Returns:
        (best_threshold, best_gain, best_weighted_impurity)
        
    Complexity:
        O(n log n) for sorting + O(n * K) for scanning
        = O(n log n) when K << n
    """
    n = len(y)
    
    # Step 1: Sort by feature value
    sorted_indices = np.argsort(X_j)
    sorted_features = X_j[sorted_indices]
    sorted_labels = y[sorted_indices]
    
    # Compute parent impurity
    parent_counts = np.bincount(y, minlength=n_classes)
    parent_impurity = impurity_fn(parent_counts)
    
    # Step 2: Initialize counts
    left_counts = np.zeros(n_classes, dtype=np.int64)
    right_counts = parent_counts.copy()
    
    best_gain = -np.inf
    best_threshold = None
    best_weighted_impurity = None
    
    left_size = 0
    right_size = n
    
    # Step 3: Scan through all split positions
    for i in range(n - 1):
        # Move sample i from right to left
        c = sorted_labels[i]
        left_counts[c] += 1
        right_counts[c] -= 1
        left_size += 1
        right_size -= 1
        
        # Skip if same value (not a valid split point)
        if sorted_features[i] >= sorted_features[i + 1]:
            continue
        
        # Compute weighted impurity
        left_impurity = impurity_fn(left_counts)
        right_impurity = impurity_fn(right_counts)
        
        weighted_impurity = (
            (left_size / n) * left_impurity + 
            (right_size / n) * right_impurity
        )
        
        gain = parent_impurity - weighted_impurity
        
        if gain > best_gain:
            best_gain = gain
            # Threshold midpoint between consecutive values
            best_threshold = (sorted_features[i] + sorted_features[i + 1]) / 2
            best_weighted_impurity = weighted_impurity
    
    return best_threshold, best_gain, best_weighted_impurity
 
 
def demonstration():
    """Demonstrate threshold finding on example data."""
    np.random.seed(42)
    
    print("Continuous Feature Threshold Search")
    print("=" * 60)
    
    # Create separable data
    n = 100
    X = np.concatenate([
        np.random.normal(2.0, 0.5, n//2),  # Class 0 centered at 2
        np.random.normal(5.0, 0.5, n//2)   # Class 1 centered at 5
    ])
    y = np.array([0]*(n//2) + [1]*(n//2))
    
    print(f"Data: {n} samples, 2 classes")
    print(f"Class 0 centered at x=2.0")
    print(f"Class 1 centered at x=5.0")
    print(f"Parent Gini: {gini_impurity(np.bincount(y, minlength=2)):.4f}")
    
    threshold, gain, weighted_imp = find_best_threshold(X, y, n_classes=2)
    
    print(f"\nOptimal split:")
    print(f"  Threshold: x ≤ {threshold:.3f}")
    print(f"  Gini Gain: {gain:.4f}")
    print(f"  Weighted Impurity: {weighted_imp:.4f}")
    
    # Verify split
    left_samples = X <= threshold
    right_samples = X > threshold
    
    print(f"\nSplit result:")
    print(f"  Left (x ≤ {threshold:.2f}): {left_samples.sum()} samples")
    print(f"    Class 0: {(y[left_samples] == 0).sum()}")
    print(f"    Class 1: {(y[left_samples] == 1).sum()}")
    print(f"  Right (x > {threshold:.2f}): {right_samples.sum()} samples")
    print(f"    Class 0: {(y[right_samples] == 0).sum()}")
    print(f"    Class 1: {(y[right_samples] == 1).sum()}")
    
    # More complex example
    print(f"\n" + "=" * 60)
    print("Multi-modal Data (3 classes)")
    print("=" * 60)
    
    X_complex = np.concatenate([
        np.random.normal(1.0, 0.3, 30),   # Class 0
        np.random.normal(3.0, 0.3, 40),   # Class 1
        np.random.normal(5.0, 0.3, 30)    # Class 2
    ])
    y_complex = np.array([0]*30 + [1]*40 + [2]*30)
    
    threshold, gain, _ = find_best_threshold(X_complex, y_complex, n_classes=3)
    print(f"Best first split: x ≤ {threshold:.3f}, Gain = {gain:.4f}")
 
 
if __name__ == "__main__":
    demonstration()

Computational Optimizations

Tree construction speed is critical—modern ensembles build thousands of trees. Here are key optimizations:

1. Pre-sorting

Sort each feature once at the root, then propagate sorted indices to children. Reduces $O(n \log n)$ sorting at each node to a single pass.

2. Histogram-Based Splitting

Instead of exact thresholds, discretize features into 256 bins:

LightGBM, CatBoost use this approach
Reduces $O(n)$ scan to $O(\text{bins}) = O(256)$
Slight accuracy loss, huge speed gain

3. Sample Subsampling

For large datasets, evaluate thresholds on a random subset:

Row subsampling (stochastic gradient boosting)
Statistical estimation of best threshold

4. Early Stopping

Stop searching if no split can beat current best:

Upper bound on remaining impurity reduction
Prune search space dynamically

histogram_splitting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import numpy as np
from typing import Tuple
 
def histogram_based_split(X_j: np.ndarray, 
                          y: np.ndarray,
                          n_classes: int,
                          n_bins: int = 256
                          ) -> Tuple[float, float]:
    """
    Histogram-based threshold finding (LightGBM/CatBoost style).
    
    Instead of evaluating every sample boundary, discretize into
    bins and evaluate bin boundaries. Much faster for large n.
    
    Complexity: O(n) for binning + O(bins * K) for search
              = O(n) when bins is constant (e.g., 256)
    """
    n = len(y)
    
    # Create histogram bins
    # Use percentile-based binning for better distribution
    percentiles = np.linspace(0, 100, n_bins + 1)
    bin_edges = np.percentile(X_j, percentiles)
    bin_edges = np.unique(bin_edges)  # Remove duplicates
    actual_bins = len(bin_edges) - 1
    
    # Assign samples to bins (vectorized)
    bin_indices = np.digitize(X_j, bin_edges[1:-1])
    
    # Build histogram: counts per bin per class
    # histogram[b, c] = count of class c in bin b
    histogram = np.zeros((actual_bins, n_classes), dtype=np.int64)
    for b in range(actual_bins):
        mask = (bin_indices == b)
        if mask.any():
            histogram[b] = np.bincount(y[mask], minlength=n_classes)
    
    # Compute cumulative sums for efficient scan
    cumsum = np.cumsum(histogram, axis=0)
    total = cumsum[-1]
    
    def gini_from_counts(counts):
        n = counts.sum()
        if n == 0:
            return 0.0
        p = counts / n
        return 1.0 - np.sum(p ** 2)
    
    parent_gini = gini_from_counts(total)
    
    best_gain = -np.inf
    best_bin = None
    
    # Scan bin boundaries
    for b in range(actual_bins - 1):
        left_counts = cumsum[b]
        right_counts = total - left_counts
        
        left_n = left_counts.sum()
        right_n = right_counts.sum()
        
        if left_n == 0 or right_n == 0:
            continue
        
        left_gini = gini_from_counts(left_counts)
        right_gini = gini_from_counts(right_counts)
        
        weighted_gini = (left_n/n) * left_gini + (right_n/n) * right_gini
        gain = parent_gini - weighted_gini
        
        if gain > best_gain:
            best_gain = gain
            best_bin = b
    
    if best_bin is None:
        return None, 0.0
    
    # Threshold is upper edge of best bin
    best_threshold = bin_edges[best_bin + 1]
    
    return best_threshold, best_gain
 
 
def benchmark_comparison():
    """Compare exact vs histogram-based splitting speed."""
    import time
    
    print("Speed Comparison: Exact vs Histogram-Based")
    print("=" * 60)
    
    sizes = [1000, 10000, 100000, 1000000]
    
    for n in sizes:
        X = np.random.randn(n)
        y = (X > 0).astype(int)
        
        # Time histogram-based
        start = time.perf_counter()
        for _ in range(3):
            histogram_based_split(X, y, n_classes=2, n_bins=256)
        hist_time = (time.perf_counter() - start) / 3
        
        print(f"n={n:>8,}: Histogram = {hist_time*1000:>8.2f} ms")
 
 
if __name__ == "__main__":
    # Example usage
    np.random.seed(42)
    X = np.random.randn(1000)
    y = (X > 0.5).astype(int)
    
    thresh, gain = histogram_based_split(X, y, n_classes=2)
    print(f"Histogram split: threshold={thresh:.3f}, gain={gain:.4f}")
    
    print()
    benchmark_comparison()

Complexity Comparison: Splitting Algorithms
Algorithm	Time per Feature	Used In
Exact (naive)	O(n² log n)	Theoretical only
Exact (presort)	O(n log n)	scikit-learn
Histogram (256 bins)	O(n)	LightGBM, CatBoost
Histogram + sampling	O(subsample)	XGBoost

Handling Ties and Duplicates

When multiple samples have the same feature value, we face the tie-breaking problem. The split $x \leq t$ must place all tied samples on the same side.

The Challenge:

If 100 samples have $x = 3.5$, we cannot split them. Any threshold $t < 3.5$ puts all 100 on the right; $t \geq 3.5$ puts all 100 on the left.

Implications:

Reduced split candidates: Ties reduce the number of distinct boundaries
Forced grouping: Tied samples must stay together
Potential suboptimality: The 'true' best split might require separating ties

Strategies:

Standard approach: Only consider thresholds between distinct values (as in our algorithm)
Random tie-breaking: Add small noise to create distinctions (used in some ensembles)
Secondary features: Break ties using another feature (requires algorithm modification)
Accept limitation: In practice, continuous features rarely have many ties unless discretized or rounded

Disaster Scenario: ID or Unique Key

If every value is unique (like customer ID), there are n-1 candidates. Overfitting is severe. This is why continuous 'features' that are actually identifiers must be excluded or treated specially.

handle_ties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
 
def count_unique_splits(X_j: np.ndarray) -> int:
    """
    Count the number of unique split candidates for a feature.
    
    This is the number of distinct boundaries between consecutive
    sorted unique values.
    """
    unique_values = np.unique(X_j)
    return len(unique_values) - 1
 
 
def analyze_tie_impact(X_j: np.ndarray, y: np.ndarray) -> dict:
    """
    Analyze how ties affect splitting for a feature.
    """
    n = len(X_j)
    unique_values = np.unique(X_j)
    n_unique = len(unique_values)
    n_splits = n_unique - 1
    
    # Count samples per unique value
    value_counts = {}
    for v in unique_values:
        mask = X_j == v
        value_counts[v] = {
            'total': mask.sum(),
            'class_dist': np.bincount(y[mask], minlength=2).tolist()
        }
    
    # Find largest tie
    max_tie = max(vc['total'] for vc in value_counts.values())
    
    return {
        'n_samples': n,
        'n_unique': n_unique,
        'n_candidate_splits': n_splits,
        'max_tie_size': max_tie,
        'tie_ratio': 1 - (n_unique / n),
        'candidate_ratio': n_splits / max(n - 1, 1)
    }
 
 
# Example
print("Tie Analysis Examples")
print("=" * 60)
 
# Continuous with no ties
X_continuous = np.random.randn(100)
y = (X_continuous > 0).astype(int)
stats = analyze_tie_impact(X_continuous, y)
print(f"\nContinuous (no ties):")
for k, v in stats.items():
    print(f"  {k}: {v}")
 
# Discretized (many ties)
X_discrete = np.round(np.random.randn(100), 0)  # Round to integer
stats = analyze_tie_impact(X_discrete, y)
print(f"\nDiscretized (rounded):")
for k, v in stats.items():
    print(f"  {k}: {v}")
 
# Extreme ties (like low-precision data)
X_extreme = np.round(np.random.uniform(0, 5, 100))
stats = analyze_tie_impact(X_extreme, y)
print(f"\nExtreme ties (5 possible values):")
for k, v in stats.items():
    print(f"  {k}: {v}")

Handling Missing Values

Real datasets often have missing values. Different tree algorithms handle this differently:

1. C4.5 Approach: Fractional Splits

Distribute samples with missing values proportionally to both children:

If 60% of non-missing samples go left, put 0.6 weight on left branch
Maintain fractional sample weights through the tree
Aggregate predictions using weights

2. CART Approach: Surrogate Splits

Find alternative features that mimic the primary split:

After choosing the best split on feature $X_j$
Find the feature $X_k$ whose split best agrees with $X_j$'s partitioning
Use $X_k$ as surrogate when $X_j$ is missing

3. XGBoost/LightGBM: Default Direction

Learn which direction missing values should take:

During training, try both left and right for missing
Learn which direction gives better impurity reduction
Store the 'default direction' with the split

4. Simple Imputation

Replace missing with median/mean before splitting:

Fast but loses information about 'missingness'
May introduce bias

missing_values.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import numpy as np
from typing import Tuple
 
def find_threshold_with_missing(X_j: np.ndarray, 
                                 y: np.ndarray,
                                 n_classes: int = 2
                                 ) -> Tuple[float, float, str]:
    """
    Find optimal threshold AND default direction for missing values.
    
    This is the XGBoost/LightGBM approach: learn which side
    missing values should go.
    """
    # Separate missing and non-missing
    missing_mask = np.isnan(X_j)
    non_missing_mask = ~missing_mask
    
    X_valid = X_j[non_missing_mask]
    y_valid = y[non_missing_mask]
    
    X_missing = X_j[missing_mask]
    y_missing = y[missing_mask]
    
    n_valid = len(X_valid)
    n_missing = len(X_missing)
    n_total = n_valid + n_missing
    
    if n_valid == 0:
        return None, 0.0, None
    
    # Find best threshold on non-missing
    sorted_idx = np.argsort(X_valid)
    sorted_X = X_valid[sorted_idx]
    sorted_y = y_valid[sorted_idx]
    
    missing_counts = np.bincount(y_missing, minlength=n_classes) if n_missing > 0 else np.zeros(n_classes)
    
    def gini(counts):
        n = counts.sum()
        if n == 0: return 0
        p = counts / n
        return 1 - np.sum(p ** 2)
    
    parent_counts = np.bincount(y, minlength=n_classes)
    parent_gini = gini(parent_counts)
    
    left_counts_valid = np.zeros(n_classes, dtype=np.int64)
    right_counts_valid = np.bincount(y_valid, minlength=n_classes)
    
    best_gain = -np.inf
    best_threshold = None
    best_direction = None
    
    left_valid_n = 0
    right_valid_n = n_valid
    
    for i in range(n_valid - 1):
        c = sorted_y[i]
        left_counts_valid[c] += 1
        right_counts_valid[c] -= 1
        left_valid_n += 1
        right_valid_n -= 1
        
        if sorted_X[i] >= sorted_X[i + 1]:
            continue
        
        threshold = (sorted_X[i] + sorted_X[i + 1]) / 2
        
        # Try missing going LEFT
        left_total = left_counts_valid + missing_counts
        right_total = right_counts_valid
        n_left = left_valid_n + n_missing
        n_right = right_valid_n
        
        w_gini_left = (n_left/n_total) * gini(left_total) + (n_right/n_total) * gini(right_total)
        gain_left = parent_gini - w_gini_left
        
        # Try missing going RIGHT
        left_total = left_counts_valid
        right_total = right_counts_valid + missing_counts
        n_left = left_valid_n
        n_right = right_valid_n + n_missing
        
        w_gini_right = (n_left/n_total) * gini(left_total) + (n_right/n_total) * gini(right_total)
        gain_right = parent_gini - w_gini_right
        
        # Choose best direction
        if gain_left > gain_right and gain_left > best_gain:
            best_gain = gain_left
            best_threshold = threshold
            best_direction = 'left'
        elif gain_right > best_gain:
            best_gain = gain_right
            best_threshold = threshold
            best_direction = 'right'
    
    return best_threshold, best_gain, best_direction
 
 
# Example
print("Missing Value Handling Example")
print("=" * 60)
 
np.random.seed(42)
n = 100
X = np.random.randn(n)
y = (X > 0).astype(int)
 
# Introduce 10% missing values
missing_idx = np.random.choice(n, 10, replace=False)
X_with_missing = X.copy()
X_with_missing[missing_idx] = np.nan
 
print(f"Total samples: {n}")
print(f"Missing: {np.isnan(X_with_missing).sum()}")
 
thresh, gain, direction = find_threshold_with_missing(X_with_missing, y)
print(f"\nOptimal split:")
print(f"  Threshold: {thresh:.3f}")
print(f"  Gain: {gain:.4f}")
print(f"  Missing goes: {direction}")
print(f"\nSplit rule: IF x ≤ {thresh:.3f} OR (x is missing AND direction='{direction}')")
print("            → go LEFT")

XGBoost's Clever Solution

Beyond Binary: Multi-way Splits for Continuous Features

While binary splits are standard, some scenarios benefit from multi-way splits:

1. Discretization Approaches

Pre-bin continuous features into intervals:

Equal-width: Divide range into $k$ equal intervals
Equal-frequency: Each bin has ~same number of samples
Supervised (e.g., MDLP): Minimize entropy across bins

2. Natural Breakpoints

Some features have meaningful thresholds:

Age: child (<13), teen (13-19), adult (20-64), senior (65+)
Income: tax brackets
Temperature: below/at/above freezing

3. Recursive Binary Splits

Binary splits on the same feature at different levels:

First split: age ≤ 30
Then: age ≤ 50 in the right branch
Effectively creates intervals [0,30], (30,50], (50,∞)

Why Binary Dominates:

Simpler optimization (one threshold vs. multiple)
Tree depth compensates (can always split more)
No need to pre-specify number of bins
Handles arbitrary distributions

Modern Practice

LightGBM, CatBoost, and XGBoost all use binary splits for continuous features. Histogram binning (256 bins) approximates multi-way splitting's benefits while maintaining binary split simplicity.

Full Tree Construction Complexity

Let's analyze the total complexity of building a decision tree with continuous features.

Single Node Split:

For each of $d$ features:
- Sort $n$ samples: $O(n \log n)$ (or $O(n)$ if pre-sorted)
- Scan thresholds: $O(nK)$
Total: $O(d \cdot n \log n)$ per node

Full Tree:

At depth 0: 1 node, $n$ samples
At depth 1: 2 nodes, ~$n/2$ samples each
At depth $h$: up to $2^h$ nodes

If tree is balanced with depth $h$: $$\text{Total work} = \sum_{\ell=0}^{h} 2^\ell \cdot d \cdot \frac{n}{2^\ell} \log \frac{n}{2^\ell}$$

This simplifies to $O(dn \log n \cdot h)$ for balanced trees.

With pre-sorting at root:

Sort all features once: $O(dn \log n)$
Per-node work becomes $O(dn)$ (no re-sorting)
Total: $O(dn \cdot h) + O(dn \log n) = O(dn \log n)$ amortized

With histogram binning:

Bin all features once: $O(dn)$
Per-node work: $O(d \cdot \text{bins} \cdot K)$
Total: $O(dn) + O(d \cdot \text{nodes} \cdot \text{bins})$ — nearly linear!

Tree Construction Complexity Summary
Method	Per Node	Full Tree (balanced)	Used By
Naive (resort each node)	$O(dn \log n)$	$O(dn^2 \log n)$	Textbook only
Pre-sorted	$O(dn)$	$O(dn \log n)$	scikit-learn
Histogram (256 bins)	$O(d \cdot 256 \cdot K)$	$O(dn + \text{nodes} \cdot d)$	LightGBM
Subsampled histogram	$O(d \cdot 256 \cdot K)$	$O(dn)$ expected	XGBoost

Practical Considerations

Best Practices for Continuous Features

•Scaling is NOT required: Trees are invariant to monotonic transformations. No need to standardize.
•Outliers can help: Extreme values create natural split points. Don't clip aggressively.
•High cardinality is fine: Unlike some models, trees handle many unique values well.
•Feature engineering helps: Log transforms for skewed data can improve splits.
•Interactions emerge naturally: Trees find interactions through hierarchical splits.
•Missing values: Use native handling (XGBoost style) over imputation when possible.

No Normalization Needed

Summary: Mastering Continuous Feature Splits

Key Takeaways

•Binary thresholds: $x \leq t$ vs $x > t$ transforms infinite possibilities to finite candidates.
•Efficient algorithm: Sort once, scan once — $O(n \log n)$ per feature.
•Only boundaries matter: Test thresholds between consecutive distinct values.
•Histogram binning: Reduces to $O(n)$ with 256 bins, minimal accuracy loss.
•Ties force grouping: Identical values must stay together.
•Missing values: Learn optimal direction (XGBoost) beats imputation.
•No scaling required: Trees are invariant to monotonic transformations.
•Full tree complexity: $O(dn \log n)$ with pre-sorting, ~$O(dn)$ with histograms.

Module Complete!

You've now completed the Splitting Criteria module. You understand:

Gini impurity: The default, fast, probabilistically interpretable
Entropy: Information-theoretic, good for imbalance
Misclassification error: Intuitive but unsuitable for splitting
Continuous features: Binary thresholds with efficient search

Next, you're ready for Tree Growing Algorithms (ID3, C4.5, CART) to see how these criteria integrate into complete tree construction algorithms.

Module Complete

5 / 5