Tree Growing Algorithms - Learning Module

Loading content...

0/278

Handling Missing Values

The Reality of Incomplete Data

In the clean world of textbook machine learning, every dataset is complete—every cell filled, every feature known. Reality is messier. Real-world datasets routinely have 10-30% missing values, and some domains (healthcare, surveys, sensor data) can have 50% or more.

Missing data isn't just an inconvenience—it's information. The pattern of missingness often carries signal about the underlying process. A patient who skips a medical test might be avoiding bad news. A sensor that stops reporting might indicate equipment failure. A survey respondent who leaves income blank might have extreme (high or low) income.

Decision trees have inherent advantages for missing data. Unlike methods requiring complete feature vectors, trees can route instances through splits using only available features. But leveraging this advantage requires understanding the different approaches and their tradeoffs.

What You Will Learn

By the end of this page, you will understand: (1) Types of missing data mechanisms and their implications; (2) Three main approaches: deletion, fractional instances, and surrogates; (3) How to compute information gain with missing values; (4) Prediction strategies when test instances have missing features; (5) Best practices for handling missingness in tree-based models.

Understanding Missing Data Mechanisms

Rubin's Classification of Missing Data:

Statistician Donald Rubin established the foundational framework for understanding missing data in 1976. Understanding these mechanisms determines which approaches are valid:

1. Missing Completely At Random (MCAR)

Probability of missingness is independent of both observed and unobserved values
Example: Lab sample accidentally dropped
Most permissive: any method works
Rare in practice

2. Missing At Random (MAR)

Probability of missingness depends on observed values but not on the missing value itself
Example: Older patients are less likely to report digital health metrics (age predicts missingness, not the metric value)
Most methods work if we condition on observed variables
Common in practice

3. Missing Not At Random (MNAR)

Probability of missingness depends on the missing value itself
Example: High-income respondents more likely to skip income questions
Problematic: most methods introduce bias
Requires special treatment or modeling

Missing Data Mechanism Implications
Mechanism	Pattern	Simple Deletion	Tree Built-in Methods
MCAR	Completely random	✓ Valid (inefficient)	✓ Valid
MAR	Predictable from other features	⚠ May be valid	✓ Valid (surrogates leverage this)
MNAR	Depends on missing value	✗ Biased	⚠ May still be biased

Diagnosing Missing Mechanisms

You can't definitively determine the mechanism from data alone (it depends on unobserved values). However, you can investigate: (1) Is missingness correlated with other variables? (2) Do complete cases differ systematically from incomplete cases? (3) Does domain knowledge suggest why values are missing?

missing_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
import pandas as pd
from scipy import stats
 
def analyze_missingness(df: pd.DataFrame, target_col: str):
    """
    Analyze missing data patterns to diagnose mechanism.
    """
    print("Missing Data Analysis")
    print("=" * 60)
    
    # Overall missingness
    print("
1. Missingness by Column:")
    print("-" * 40)
    for col in df.columns:
        missing_pct = df[col].isna().mean() * 100
        print(f"  {col}: {missing_pct:.1f}% missing")
    
    # Correlation between missingness indicators
    print("
2. Missingness Correlations:")
    print("-" * 40)
    miss_indicators = df.isna().astype(int)
    
    # Check if missingness in one column predicts another
    for col1 in df.columns:
        for col2 in df.columns:
            if col1 >= col2:
                continue
            if df[col1].isna().sum() == 0 or df[col2].isna().sum() == 0:
                continue
            
            corr = miss_indicators[col1].corr(miss_indicators[col2])
            if abs(corr) > 0.2:
                print(f"  {col1} <-> {col2}: {corr:.2f}")
    
    # Does missingness predict target?
    print("
3. Target Distribution by Missingness:")
    print("-" * 40)
    for col in df.columns:
        if col == target_col or df[col].isna().sum() == 0:
            continue
        
        has_value = df[~df[col].isna()][target_col]
        is_missing = df[df[col].isna()][target_col]
        
        if len(is_missing) > 5:
            # Compare means or proportions
            if df[target_col].dtype in ['int64', 'float64']:
                t_stat, p_val = stats.ttest_ind(has_value, is_missing)
                if p_val < 0.05:
                    print(f"  {col} missing → {target_col} mean: "
                          f"{is_missing.mean():.2f} vs {has_value.mean():.2f} "
                          f"(p={p_val:.3f}) ⚠")
 
# Example usage
np.random.seed(42)
n = 500
 
# Simulate dataset with different missing mechanisms
age = np.random.uniform(20, 80, n)
income = 20000 + 1000 * age + np.random.normal(0, 15000, n)
will_buy = (0.01 * income + 0.5 * (age > 50) + np.random.normal(0, 0.5, n)) > 50
 
# Introduce missingness:
# - Age: MCAR (random 5%)
# - Income: MAR (older people less likely to report)
# - Will_buy: no missing
 
age_missing = np.random.random(n) < 0.05  # MCAR
income_missing = np.random.random(n) < (0.1 + 0.4 * (age > 60))  # MAR
 
age[age_missing] = np.nan
income[income_missing] = np.nan
 
df = pd.DataFrame({
    'age': age, 
    'income': income, 
    'will_buy': will_buy.astype(int)
})
 
analyze_missingness(df, 'will_buy')

Deletion Approaches

The Simplest Solution: Delete Missing Data

The most straightforward approach to missing data is deletion. There are two variants:

Listwise Deletion (Complete Case Analysis)

Remove any instance with any missing value
Simple to implement; maintains consistency
Can lose substantial data (if 10 features each have 5% missing independently, ~40% of instances are incomplete)
Valid under MCAR; may bias under MAR/MNAR

Pairwise Deletion

Use available data for each calculation
When computing a split: use instances with known values for that attribute
Preserves more data
Can lead to inconsistencies (different samples for different attributes)

When Deletion Works

•Low missingness rate (<5%)
•Large dataset (can afford to lose cases)
•MCAR mechanism confirmed
•Missing features unlikely for test data
•Simplicity is paramount

When Deletion Fails

•High missingness (>10%)
•Small dataset (every instance matters)
•MAR/MNAR mechanisms
•Missingness carries information
•Test data will have missing values

deletion_impact.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
 
def simulate_deletion_impact(n_features: int, miss_rate: float, n_samples: int):
    """
    Simulate the impact of listwise deletion.
    
    Assumes independent missingness across features.
    """
    # Probability that an instance is complete
    # P(all features present) = (1 - miss_rate)^n_features
    p_complete = (1 - miss_rate) ** n_features
    
    expected_remaining = n_samples * p_complete
    expected_lost = n_samples * (1 - p_complete)
    
    return p_complete, expected_remaining, expected_lost
 
print("Impact of Listwise Deletion")
print("=" * 60)
print(f"{'Features':<10} | {'Miss Rate':<12} | {'% Remaining':<14} | {'Lost from 1000'}")
print("-" * 60)
 
for n_feat in [5, 10, 20, 50]:
    for miss_rate in [0.01, 0.05, 0.10]:
        p_complete, remaining, lost = simulate_deletion_impact(n_feat, miss_rate, 1000)
        print(f"{n_feat:<10} | {miss_rate:<12.0%} | {p_complete:<14.1%} | {lost:.0f}")
 
print()
print("Key insight: Even 5% per-feature missingness with 20 features")
print("loses ~64% of data through listwise deletion!")

ID3's Limitation

Original ID3 essentially uses listwise deletion—instances with missing values for the splitting attribute are excluded. This is why C4.5 and CART developed more sophisticated approaches. If you're using a basic ID3 implementation, be aware of this limitation.

C4.5's Fractional Instance Approach

The Fractional Instance Method:

C4.5's approach treats missing values probabilistically by splitting instances across branches according to observed proportions.

During Training:

When computing information gain for attribute $A$:
- Use only instances with known $A$ values
- Multiply the gain by the fraction with known values
When building child nodes:
- Instances with known values go to appropriate branch (weight 1.0)
- Instances with missing values go to ALL branches
- Weight = proportion of known-value instances in that branch

During Prediction:

When classifying an instance with missing feature values:

At each node with missing split attribute, send to all children
Weight predictions by branch proportions
Aggregate weighted votes at leaves

fractional_instances.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
from collections import Counter, defaultdict
from typing import Dict, List, Tuple
 
def compute_branch_weights(x: np.ndarray) -> Dict[str, float]:
    """
    Compute branch weights from known values.
    
    These weights are used to distribute instances
    with missing values across branches.
    """
    # Filter to known values only
    known_mask = np.array([v is not None and not (isinstance(v, float) and np.isnan(v)) 
                           for v in x])
    known_values = x[known_mask]
    
    if len(known_values) == 0:
        return {}
    
    counts = Counter(known_values)
    total = sum(counts.values())
    
    return {v: c/total for v, c in counts.items()}
 
def weighted_entropy(class_weights: Dict[str, float]) -> float:
    """Compute entropy from weighted class counts."""
    total = sum(class_weights.values())
    if total == 0:
        return 0.0
    
    probs = [w/total for w in class_weights.values() if w > 0]
    return -sum(p * np.log2(p) for p in probs)
 
def information_gain_fractional(y: np.ndarray, x: np.ndarray,
                                 weights: np.ndarray = None) -> float:
    """
    Compute information gain with fractional instances (C4.5 style).
    
    Parameters:
        y: Class labels
        x: Attribute values (may contain None/NaN for missing)
        weights: Instance weights (default: all 1.0)
    """
    n = len(y)
    if weights is None:
        weights = np.ones(n)
    
    # Identify known vs missing
    known_mask = np.array([v is not None and 
                           not (isinstance(v, float) and np.isnan(v)) 
                           for v in x])
    
    total_weight = np.sum(weights)
    known_weight = np.sum(weights[known_mask])
    
    if known_weight == 0:
        return 0.0
    
    # Fraction of known values
    fraction_known = known_weight / total_weight
    
    # Parent entropy (using all instances)
    parent_class_weights = defaultdict(float)
    for yi, wi in zip(y, weights):
        parent_class_weights[yi] += wi
    parent_entropy = weighted_entropy(parent_class_weights)
    
    # Child entropy (using known values only)
    x_known = x[known_mask]
    y_known = y[known_mask]
    w_known = weights[known_mask]
    
    branch_weights = compute_branch_weights(x_known)
    
    child_entropy = 0.0
    for value in branch_weights:
        branch_mask = x_known == value
        branch_total = np.sum(w_known[branch_mask])
        
        if branch_total > 0:
            branch_class_weights = defaultdict(float)
            for yi, wi in zip(y_known[branch_mask], w_known[branch_mask]):
                branch_class_weights[yi] += wi
            
            branch_entropy = weighted_entropy(branch_class_weights)
            child_entropy += (branch_total / known_weight) * branch_entropy
    
    # Information gain, adjusted for fraction known
    ig = parent_entropy - child_entropy
    return ig * fraction_known
 
# Demonstration
print("Fractional Instances: Information Gain Calculation")
print("=" * 55)
 
# Dataset with missing values
outlook = np.array(['Sunny', 'Sunny', None, 'Rain', 'Rain', None,
                    'Overcast', 'Sunny', 'Sunny', 'Rain'])
play = np.array(['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 
                 'Yes', 'No', 'Yes', 'Yes'])
 
# Without accounting for missing (complete cases only)
known_mask = np.array([v is not None for v in outlook])
print(f"Total instances: {len(play)}")
print(f"Known outlook values: {np.sum(known_mask)}")
print(f"Missing outlook values: {np.sum(~known_mask)}")
 
# Compute IG with fractional method
ig = information_gain_fractional(play, outlook)
print(f"
Information Gain (fractional): {ig:.4f}")
 
# Branch weights for missing value distribution
branch_weights = compute_branch_weights(outlook)
print(f"
Branch weights for missing instances:")
for branch, weight in branch_weights.items():
    print(f"  {branch}: {weight:.2%}")

Weighted Prediction

When an instance with a missing feature reaches a node during prediction, it's 'split' across all branches. If Sunny branch has weight 0.4 and predicts 'Yes', while Rain branch has weight 0.6 and predicts 'No', the weighted prediction favors 'No' with 60% confidence.

CART's Surrogate Splits in Depth

The Surrogate Philosophy:

CART's surrogate splits represent a fundamentally different philosophy: instead of distributing instances probabilistically, find backup features that make similar decisions.

The insight is that in real data, features are often correlated. If we can't see 'income' for an instance, perhaps 'education' or 'neighborhood' can predict how 'income' would have split.

Mathematical Foundation:

For a primary split that sends fraction $p_L$ left and $p_R$ right, a surrogate split on feature $S$ is evaluated by its agreement with the primary:

$$\text{Agreement}(S) = \frac{\text{instances sent same direction by both}}{\text{total instances}}$$

The Majority Rule Baseline:

A trivial 'surrogate' is to always go majority direction. If 60% go left, majority rule achieves 60% agreement. A useful surrogate must beat this baseline.

surrogate_computation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
import numpy as np
from typing import List, Tuple, Dict
from dataclasses import dataclass
 
@dataclass
class Surrogate:
    """Represents a surrogate split."""
    feature_idx: int
    feature_name: str
    threshold: float
    direction: str  # 'normal' or 'reversed'
    agreement: float
    improvement: float  # over baseline
 
def find_all_surrogates(X: np.ndarray, 
                        primary_feature: int,
                        primary_threshold: float,
                        feature_names: List[str],
                        min_improvement: float = 0.01) -> List[Surrogate]:
    """
    Find all valid surrogate splits for a primary split.
    
    Parameters:
        X: Feature matrix
        primary_feature: Index of primary split feature
        primary_threshold: Threshold for primary split
        feature_names: Names of features
        min_improvement: Minimum improvement over baseline required
        
    Returns:
        List of Surrogate objects, sorted by agreement
    """
    n_samples, n_features = X.shape
    
    # Determine primary split direction
    # Handle missing values in primary
    primary_values = X[:, primary_feature]
    primary_known = ~np.isnan(primary_values)
    
    if np.sum(primary_known) == 0:
        return []
    
    primary_left = primary_values <= primary_threshold
    
    # Baseline: majority direction
    left_fraction = np.mean(primary_left[primary_known])
    baseline = max(left_fraction, 1 - left_fraction)
    
    surrogates = []
    
    for feat_idx in range(n_features):
        if feat_idx == primary_feature:
            continue
        
        feat_values = X[:, feat_idx]
        feat_known = ~np.isnan(feat_values)
        
        # Need instances where both are known
        both_known = primary_known & feat_known
        if np.sum(both_known) < 10:
            continue
        
        feat_subset = feat_values[both_known]
        primary_subset = primary_left[both_known]
        
        # Try all thresholds
        unique_vals = np.unique(feat_subset)
        
        best_agreement = baseline
        best_threshold = None
        best_direction = None
        
        for i in range(len(unique_vals) - 1):
            thresh = (unique_vals[i] + unique_vals[i+1]) / 2
            
            surrogate_left = feat_subset <= thresh
            
            # Normal direction
            agree_normal = np.mean(surrogate_left == primary_subset)
            # Reversed direction
            agree_reversed = np.mean((~surrogate_left) == primary_subset)
            
            if agree_normal >= best_agreement:
                best_agreement = agree_normal
                best_threshold = thresh
                best_direction = 'normal'
            
            if agree_reversed >= best_agreement:
                best_agreement = agree_reversed
                best_threshold = thresh
                best_direction = 'reversed'
        
        improvement = best_agreement - baseline
        
        if improvement >= min_improvement and best_threshold is not None:
            surrogates.append(Surrogate(
                feature_idx=feat_idx,
                feature_name=feature_names[feat_idx],
                threshold=best_threshold,
                direction=best_direction,
                agreement=best_agreement,
                improvement=improvement
            ))
    
    # Sort by agreement (descending)
    surrogates.sort(key=lambda s: s.agreement, reverse=True)
    
    return surrogates
 
# Demonstration with correlated medical data
np.random.seed(42)
n = 500
 
# Correlated health features
age = np.random.uniform(30, 80, n)
blood_pressure = 80 + 0.8 * age + np.random.normal(0, 10, n)
cholesterol = 150 + 1.2 * age + np.random.normal(0, 25, n)
glucose = 70 + 0.3 * age + np.random.normal(0, 15, n)
 
# Add some missing values
age_missing = np.random.random(n) < 0.15
blood_pressure[age_missing] = np.nan
 
X = np.column_stack([age, blood_pressure, cholesterol, glucose])
feature_names = ['Age', 'BloodPressure', 'Cholesterol', 'Glucose']
 
# Primary split on Age
primary_idx = 0
primary_threshold = 55
 
print("Surrogate Split Analysis: Medical Data")
print("=" * 60)
print(f"Primary split: Age <= {primary_threshold}")
print()
 
surrogates = find_all_surrogates(X, primary_idx, primary_threshold, feature_names)
 
print(f"{'Surrogate':<20} | {'Threshold':<12} | {'Agreement':<12} | {'Improv.'}")
print("-" * 60)
 
for s in surrogates:
    dir_symbol = '≤' if s.direction == 'normal' else '>'
    print(f"{s.feature_name:<20} | {dir_symbol} {s.threshold:<9.1f} | "
          f"{s.agreement:<12.1%} | {s.improvement:+.1%}")

Using Surrogates During Prediction:

function predict(instance, node):
    if node is leaf:
        return node.prediction
    
    split_value = instance[node.split_feature]
    
    if split_value is missing:
        for surrogate in node.surrogates:
            surrogate_value = instance[surrogate.feature]
            if surrogate_value is not missing:
                if surrogate.direction == 'normal':
                    go_left = surrogate_value <= surrogate.threshold
                else:
                    go_left = surrogate_value > surrogate.threshold
                break
        else:
            # All surrogates missing: use majority
            go_left = node.majority_direction
    else:
        go_left = split_value <= node.threshold
    
    if go_left:
        return predict(instance, node.left)
    else:
        return predict(instance, node.right)

Surrogates for Variable Importance

Beyond missing values, surrogates reveal feature relationships. If 'education' is a strong surrogate for 'income', they carry overlapping information. This insight is used in CART's variable importance measures and helps interpret feature relationships.

The Missing Indicator Approach

Treating Missingness as Information:

A simple but powerful approach: create an additional binary feature indicating whether the original feature is missing.

For each feature $X_j$:

Create indicator: $M_j = 1$ if $X_j$ is missing, else 0
Impute $X_j$ with some value (mean, median, constant)
Include both $X_j$ (imputed) and $M_j$ in the feature set

The tree can now learn:

If missingness itself predicts the target, it can split on $M_j$
If the value matters when present, it can split on $X_j$ for instances where $M_j = 0$

missing_indicator.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
import pandas as pd
from typing import Tuple
 
def add_missing_indicators(X: np.ndarray, 
                           feature_names: list) -> Tuple[np.ndarray, list]:
    """
    Add missing indicator columns and impute missing values.
    
    Parameters:
        X: Feature matrix with NaN for missing
        feature_names: Original feature names
        
    Returns:
        Tuple of (augmented feature matrix, new feature names)
    """
    n_samples, n_features = X.shape
    
    new_features = []
    new_names = []
    
    for j in range(n_features):
        col = X[:, j].copy()
        missing_mask = np.isnan(col)
        
        # Add original column (will be imputed)
        if np.sum(~missing_mask) > 0:
            impute_value = np.median(col[~missing_mask])
        else:
            impute_value = 0
        
        col[missing_mask] = impute_value
        new_features.append(col)
        new_names.append(feature_names[j])
        
        # Add missing indicator if any missing
        if np.any(missing_mask):
            indicator = missing_mask.astype(float)
            new_features.append(indicator)
            new_names.append(f"{feature_names[j]}_missing")
    
    return np.column_stack(new_features), new_names
 
# Demonstration where missingness carries signal
np.random.seed(42)
n = 500
 
# Scenario: Survey about income
# High-income people more likely to skip income question (MNAR)
true_income = np.random.lognormal(10.5, 0.8, n)
 
# Probability of skipping increases with income
skip_prob = 0.1 + 0.4 * (true_income > np.percentile(true_income, 70))
skip_mask = np.random.random(n) < skip_prob
 
observed_income = true_income.copy()
observed_income[skip_mask] = np.nan
 
# Target: likelihood to purchase luxury item (correlates with true income)
will_purchase = (true_income > np.percentile(true_income, 60)).astype(int)
 
# Create feature matrix
X = observed_income.reshape(-1, 1)
X_augmented, names = add_missing_indicators(X, ['income'])
 
print("Missing Indicator Demonstration")
print("=" * 55)
print(f"Total instances: {n}")
print(f"Missing income: {np.sum(skip_mask)} ({np.mean(skip_mask):.1%})")
print()
 
# Analyze predictive power of missingness
missing_indicator = skip_mask.astype(int)
 
purchase_when_missing = np.mean(will_purchase[skip_mask])
purchase_when_known = np.mean(will_purchase[~skip_mask])
 
print("Purchase rate by income missingness:")
print(f"  Income missing:  {purchase_when_missing:.1%}")
print(f"  Income present:  {purchase_when_known:.1%}")
print()
print("The missing indicator carries signal!")
print("High-income people skip the question AND buy luxury items.")
print()
 
print(f"Augmented features: {names}")

Advantages

•Captures informative missingness (MNAR)
•Works with any tree implementation
•Explicit and interpretable
•No special handling needed at prediction

Disadvantages

•Doubles feature count in worst case
•Imputation value choice matters
•Can miss interactions with true values
•May overfit to training missingness patterns

Best Practices for Missing Values

Recommended Approach

•Analyze missingness first: Understand why values are missing before choosing an approach. Use the diagnostic tools shown earlier.
•Use built-in handling for MAR: For MAR data, CART surrogates and C4.5 fractional instances work well. These are automatic in most implementations.
•Add missing indicators for MNAR: If missingness carries signal, explicitly add indicator features.
•Consider multiple strategies: Try different approaches and evaluate on held-out data. The best method is problem-dependent.
•Be consistent train/test: Whatever strategy you use during training, apply identically at prediction time.
•Document your approach: Missing value handling significantly affects model behavior. Document the approach for reproducibility.

Strategy Selection Guide
Situation	Recommended Approach
<5% missing, MCAR	Deletion or simple imputation
5-20% missing, MAR	Built-in handling (surrogates/fractional)
Any %, MNAR	Missing indicators + imputation
Correlated features	Surrogates (exploit correlations)
Many features missing together	Pattern-based analysis; consider subset modeling
Test data will have different missingness	Most robust imputation; test extensively

Page Complete

You now understand the full landscape of missing value handling in decision trees: mechanisms (MCAR/MAR/MNAR), deletion approaches, fractional instances, surrogate splits, and missing indicators. The final page covers handling categorical features with many values.