Loading content...
In the clean world of textbook machine learning, every dataset is complete—every cell filled, every feature known. Reality is messier. Real-world datasets routinely have 10-30% missing values, and some domains (healthcare, surveys, sensor data) can have 50% or more.
Missing data isn't just an inconvenience—it's information. The pattern of missingness often carries signal about the underlying process. A patient who skips a medical test might be avoiding bad news. A sensor that stops reporting might indicate equipment failure. A survey respondent who leaves income blank might have extreme (high or low) income.
Decision trees have inherent advantages for missing data. Unlike methods requiring complete feature vectors, trees can route instances through splits using only available features. But leveraging this advantage requires understanding the different approaches and their tradeoffs.
By the end of this page, you will understand: (1) Types of missing data mechanisms and their implications; (2) Three main approaches: deletion, fractional instances, and surrogates; (3) How to compute information gain with missing values; (4) Prediction strategies when test instances have missing features; (5) Best practices for handling missingness in tree-based models.
Rubin's Classification of Missing Data:
Statistician Donald Rubin established the foundational framework for understanding missing data in 1976. Understanding these mechanisms determines which approaches are valid:
1. Missing Completely At Random (MCAR)
2. Missing At Random (MAR)
3. Missing Not At Random (MNAR)
| Mechanism | Pattern | Simple Deletion | Tree Built-in Methods |
|---|---|---|---|
| MCAR | Completely random | ✓ Valid (inefficient) | ✓ Valid |
| MAR | Predictable from other features | ⚠ May be valid | ✓ Valid (surrogates leverage this) |
| MNAR | Depends on missing value | ✗ Biased | ⚠ May still be biased |
You can't definitively determine the mechanism from data alone (it depends on unobserved values). However, you can investigate: (1) Is missingness correlated with other variables? (2) Do complete cases differ systematically from incomplete cases? (3) Does domain knowledge suggest why values are missing?
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as npimport pandas as pdfrom scipy import stats def analyze_missingness(df: pd.DataFrame, target_col: str): """ Analyze missing data patterns to diagnose mechanism. """ print("Missing Data Analysis") print("=" * 60) # Overall missingness print("1. Missingness by Column:") print("-" * 40) for col in df.columns: missing_pct = df[col].isna().mean() * 100 print(f" {col}: {missing_pct:.1f}% missing") # Correlation between missingness indicators print("2. Missingness Correlations:") print("-" * 40) miss_indicators = df.isna().astype(int) # Check if missingness in one column predicts another for col1 in df.columns: for col2 in df.columns: if col1 >= col2: continue if df[col1].isna().sum() == 0 or df[col2].isna().sum() == 0: continue corr = miss_indicators[col1].corr(miss_indicators[col2]) if abs(corr) > 0.2: print(f" {col1} <-> {col2}: {corr:.2f}") # Does missingness predict target? print("3. Target Distribution by Missingness:") print("-" * 40) for col in df.columns: if col == target_col or df[col].isna().sum() == 0: continue has_value = df[~df[col].isna()][target_col] is_missing = df[df[col].isna()][target_col] if len(is_missing) > 5: # Compare means or proportions if df[target_col].dtype in ['int64', 'float64']: t_stat, p_val = stats.ttest_ind(has_value, is_missing) if p_val < 0.05: print(f" {col} missing → {target_col} mean: " f"{is_missing.mean():.2f} vs {has_value.mean():.2f} " f"(p={p_val:.3f}) ⚠") # Example usagenp.random.seed(42)n = 500 # Simulate dataset with different missing mechanismsage = np.random.uniform(20, 80, n)income = 20000 + 1000 * age + np.random.normal(0, 15000, n)will_buy = (0.01 * income + 0.5 * (age > 50) + np.random.normal(0, 0.5, n)) > 50 # Introduce missingness:# - Age: MCAR (random 5%)# - Income: MAR (older people less likely to report)# - Will_buy: no missing age_missing = np.random.random(n) < 0.05 # MCARincome_missing = np.random.random(n) < (0.1 + 0.4 * (age > 60)) # MAR age[age_missing] = np.nanincome[income_missing] = np.nan df = pd.DataFrame({ 'age': age, 'income': income, 'will_buy': will_buy.astype(int)}) analyze_missingness(df, 'will_buy')The Simplest Solution: Delete Missing Data
The most straightforward approach to missing data is deletion. There are two variants:
Listwise Deletion (Complete Case Analysis)
Pairwise Deletion
123456789101112131415161718192021222324252627282930
import numpy as np def simulate_deletion_impact(n_features: int, miss_rate: float, n_samples: int): """ Simulate the impact of listwise deletion. Assumes independent missingness across features. """ # Probability that an instance is complete # P(all features present) = (1 - miss_rate)^n_features p_complete = (1 - miss_rate) ** n_features expected_remaining = n_samples * p_complete expected_lost = n_samples * (1 - p_complete) return p_complete, expected_remaining, expected_lost print("Impact of Listwise Deletion")print("=" * 60)print(f"{'Features':<10} | {'Miss Rate':<12} | {'% Remaining':<14} | {'Lost from 1000'}")print("-" * 60) for n_feat in [5, 10, 20, 50]: for miss_rate in [0.01, 0.05, 0.10]: p_complete, remaining, lost = simulate_deletion_impact(n_feat, miss_rate, 1000) print(f"{n_feat:<10} | {miss_rate:<12.0%} | {p_complete:<14.1%} | {lost:.0f}") print()print("Key insight: Even 5% per-feature missingness with 20 features")print("loses ~64% of data through listwise deletion!")Original ID3 essentially uses listwise deletion—instances with missing values for the splitting attribute are excluded. This is why C4.5 and CART developed more sophisticated approaches. If you're using a basic ID3 implementation, be aware of this limitation.
The Fractional Instance Method:
C4.5's approach treats missing values probabilistically by splitting instances across branches according to observed proportions.
During Training:
When computing information gain for attribute $A$:
When building child nodes:
During Prediction:
When classifying an instance with missing feature values:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
import numpy as npfrom collections import Counter, defaultdictfrom typing import Dict, List, Tuple def compute_branch_weights(x: np.ndarray) -> Dict[str, float]: """ Compute branch weights from known values. These weights are used to distribute instances with missing values across branches. """ # Filter to known values only known_mask = np.array([v is not None and not (isinstance(v, float) and np.isnan(v)) for v in x]) known_values = x[known_mask] if len(known_values) == 0: return {} counts = Counter(known_values) total = sum(counts.values()) return {v: c/total for v, c in counts.items()} def weighted_entropy(class_weights: Dict[str, float]) -> float: """Compute entropy from weighted class counts.""" total = sum(class_weights.values()) if total == 0: return 0.0 probs = [w/total for w in class_weights.values() if w > 0] return -sum(p * np.log2(p) for p in probs) def information_gain_fractional(y: np.ndarray, x: np.ndarray, weights: np.ndarray = None) -> float: """ Compute information gain with fractional instances (C4.5 style). Parameters: y: Class labels x: Attribute values (may contain None/NaN for missing) weights: Instance weights (default: all 1.0) """ n = len(y) if weights is None: weights = np.ones(n) # Identify known vs missing known_mask = np.array([v is not None and not (isinstance(v, float) and np.isnan(v)) for v in x]) total_weight = np.sum(weights) known_weight = np.sum(weights[known_mask]) if known_weight == 0: return 0.0 # Fraction of known values fraction_known = known_weight / total_weight # Parent entropy (using all instances) parent_class_weights = defaultdict(float) for yi, wi in zip(y, weights): parent_class_weights[yi] += wi parent_entropy = weighted_entropy(parent_class_weights) # Child entropy (using known values only) x_known = x[known_mask] y_known = y[known_mask] w_known = weights[known_mask] branch_weights = compute_branch_weights(x_known) child_entropy = 0.0 for value in branch_weights: branch_mask = x_known == value branch_total = np.sum(w_known[branch_mask]) if branch_total > 0: branch_class_weights = defaultdict(float) for yi, wi in zip(y_known[branch_mask], w_known[branch_mask]): branch_class_weights[yi] += wi branch_entropy = weighted_entropy(branch_class_weights) child_entropy += (branch_total / known_weight) * branch_entropy # Information gain, adjusted for fraction known ig = parent_entropy - child_entropy return ig * fraction_known # Demonstrationprint("Fractional Instances: Information Gain Calculation")print("=" * 55) # Dataset with missing valuesoutlook = np.array(['Sunny', 'Sunny', None, 'Rain', 'Rain', None, 'Overcast', 'Sunny', 'Sunny', 'Rain'])play = np.array(['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes']) # Without accounting for missing (complete cases only)known_mask = np.array([v is not None for v in outlook])print(f"Total instances: {len(play)}")print(f"Known outlook values: {np.sum(known_mask)}")print(f"Missing outlook values: {np.sum(~known_mask)}") # Compute IG with fractional methodig = information_gain_fractional(play, outlook)print(f"Information Gain (fractional): {ig:.4f}") # Branch weights for missing value distributionbranch_weights = compute_branch_weights(outlook)print(f"Branch weights for missing instances:")for branch, weight in branch_weights.items(): print(f" {branch}: {weight:.2%}")When an instance with a missing feature reaches a node during prediction, it's 'split' across all branches. If Sunny branch has weight 0.4 and predicts 'Yes', while Rain branch has weight 0.6 and predicts 'No', the weighted prediction favors 'No' with 60% confidence.
The Surrogate Philosophy:
CART's surrogate splits represent a fundamentally different philosophy: instead of distributing instances probabilistically, find backup features that make similar decisions.
The insight is that in real data, features are often correlated. If we can't see 'income' for an instance, perhaps 'education' or 'neighborhood' can predict how 'income' would have split.
Mathematical Foundation:
For a primary split that sends fraction $p_L$ left and $p_R$ right, a surrogate split on feature $S$ is evaluated by its agreement with the primary:
$$\text{Agreement}(S) = \frac{\text{instances sent same direction by both}}{\text{total instances}}$$
The Majority Rule Baseline:
A trivial 'surrogate' is to always go majority direction. If 60% go left, majority rule achieves 60% agreement. A useful surrogate must beat this baseline.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144
import numpy as npfrom typing import List, Tuple, Dictfrom dataclasses import dataclass @dataclassclass Surrogate: """Represents a surrogate split.""" feature_idx: int feature_name: str threshold: float direction: str # 'normal' or 'reversed' agreement: float improvement: float # over baseline def find_all_surrogates(X: np.ndarray, primary_feature: int, primary_threshold: float, feature_names: List[str], min_improvement: float = 0.01) -> List[Surrogate]: """ Find all valid surrogate splits for a primary split. Parameters: X: Feature matrix primary_feature: Index of primary split feature primary_threshold: Threshold for primary split feature_names: Names of features min_improvement: Minimum improvement over baseline required Returns: List of Surrogate objects, sorted by agreement """ n_samples, n_features = X.shape # Determine primary split direction # Handle missing values in primary primary_values = X[:, primary_feature] primary_known = ~np.isnan(primary_values) if np.sum(primary_known) == 0: return [] primary_left = primary_values <= primary_threshold # Baseline: majority direction left_fraction = np.mean(primary_left[primary_known]) baseline = max(left_fraction, 1 - left_fraction) surrogates = [] for feat_idx in range(n_features): if feat_idx == primary_feature: continue feat_values = X[:, feat_idx] feat_known = ~np.isnan(feat_values) # Need instances where both are known both_known = primary_known & feat_known if np.sum(both_known) < 10: continue feat_subset = feat_values[both_known] primary_subset = primary_left[both_known] # Try all thresholds unique_vals = np.unique(feat_subset) best_agreement = baseline best_threshold = None best_direction = None for i in range(len(unique_vals) - 1): thresh = (unique_vals[i] + unique_vals[i+1]) / 2 surrogate_left = feat_subset <= thresh # Normal direction agree_normal = np.mean(surrogate_left == primary_subset) # Reversed direction agree_reversed = np.mean((~surrogate_left) == primary_subset) if agree_normal >= best_agreement: best_agreement = agree_normal best_threshold = thresh best_direction = 'normal' if agree_reversed >= best_agreement: best_agreement = agree_reversed best_threshold = thresh best_direction = 'reversed' improvement = best_agreement - baseline if improvement >= min_improvement and best_threshold is not None: surrogates.append(Surrogate( feature_idx=feat_idx, feature_name=feature_names[feat_idx], threshold=best_threshold, direction=best_direction, agreement=best_agreement, improvement=improvement )) # Sort by agreement (descending) surrogates.sort(key=lambda s: s.agreement, reverse=True) return surrogates # Demonstration with correlated medical datanp.random.seed(42)n = 500 # Correlated health featuresage = np.random.uniform(30, 80, n)blood_pressure = 80 + 0.8 * age + np.random.normal(0, 10, n)cholesterol = 150 + 1.2 * age + np.random.normal(0, 25, n)glucose = 70 + 0.3 * age + np.random.normal(0, 15, n) # Add some missing valuesage_missing = np.random.random(n) < 0.15blood_pressure[age_missing] = np.nan X = np.column_stack([age, blood_pressure, cholesterol, glucose])feature_names = ['Age', 'BloodPressure', 'Cholesterol', 'Glucose'] # Primary split on Ageprimary_idx = 0primary_threshold = 55 print("Surrogate Split Analysis: Medical Data")print("=" * 60)print(f"Primary split: Age <= {primary_threshold}")print() surrogates = find_all_surrogates(X, primary_idx, primary_threshold, feature_names) print(f"{'Surrogate':<20} | {'Threshold':<12} | {'Agreement':<12} | {'Improv.'}")print("-" * 60) for s in surrogates: dir_symbol = '≤' if s.direction == 'normal' else '>' print(f"{s.feature_name:<20} | {dir_symbol} {s.threshold:<9.1f} | " f"{s.agreement:<12.1%} | {s.improvement:+.1%}")Using Surrogates During Prediction:
function predict(instance, node):
if node is leaf:
return node.prediction
split_value = instance[node.split_feature]
if split_value is missing:
for surrogate in node.surrogates:
surrogate_value = instance[surrogate.feature]
if surrogate_value is not missing:
if surrogate.direction == 'normal':
go_left = surrogate_value <= surrogate.threshold
else:
go_left = surrogate_value > surrogate.threshold
break
else:
# All surrogates missing: use majority
go_left = node.majority_direction
else:
go_left = split_value <= node.threshold
if go_left:
return predict(instance, node.left)
else:
return predict(instance, node.right)
Beyond missing values, surrogates reveal feature relationships. If 'education' is a strong surrogate for 'income', they carry overlapping information. This insight is used in CART's variable importance measures and helps interpret feature relationships.
Treating Missingness as Information:
A simple but powerful approach: create an additional binary feature indicating whether the original feature is missing.
For each feature $X_j$:
The tree can now learn:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import numpy as npimport pandas as pdfrom typing import Tuple def add_missing_indicators(X: np.ndarray, feature_names: list) -> Tuple[np.ndarray, list]: """ Add missing indicator columns and impute missing values. Parameters: X: Feature matrix with NaN for missing feature_names: Original feature names Returns: Tuple of (augmented feature matrix, new feature names) """ n_samples, n_features = X.shape new_features = [] new_names = [] for j in range(n_features): col = X[:, j].copy() missing_mask = np.isnan(col) # Add original column (will be imputed) if np.sum(~missing_mask) > 0: impute_value = np.median(col[~missing_mask]) else: impute_value = 0 col[missing_mask] = impute_value new_features.append(col) new_names.append(feature_names[j]) # Add missing indicator if any missing if np.any(missing_mask): indicator = missing_mask.astype(float) new_features.append(indicator) new_names.append(f"{feature_names[j]}_missing") return np.column_stack(new_features), new_names # Demonstration where missingness carries signalnp.random.seed(42)n = 500 # Scenario: Survey about income# High-income people more likely to skip income question (MNAR)true_income = np.random.lognormal(10.5, 0.8, n) # Probability of skipping increases with incomeskip_prob = 0.1 + 0.4 * (true_income > np.percentile(true_income, 70))skip_mask = np.random.random(n) < skip_prob observed_income = true_income.copy()observed_income[skip_mask] = np.nan # Target: likelihood to purchase luxury item (correlates with true income)will_purchase = (true_income > np.percentile(true_income, 60)).astype(int) # Create feature matrixX = observed_income.reshape(-1, 1)X_augmented, names = add_missing_indicators(X, ['income']) print("Missing Indicator Demonstration")print("=" * 55)print(f"Total instances: {n}")print(f"Missing income: {np.sum(skip_mask)} ({np.mean(skip_mask):.1%})")print() # Analyze predictive power of missingnessmissing_indicator = skip_mask.astype(int) purchase_when_missing = np.mean(will_purchase[skip_mask])purchase_when_known = np.mean(will_purchase[~skip_mask]) print("Purchase rate by income missingness:")print(f" Income missing: {purchase_when_missing:.1%}")print(f" Income present: {purchase_when_known:.1%}")print()print("The missing indicator carries signal!")print("High-income people skip the question AND buy luxury items.")print() print(f"Augmented features: {names}")| Situation | Recommended Approach |
|---|---|
| <5% missing, MCAR | Deletion or simple imputation |
| 5-20% missing, MAR | Built-in handling (surrogates/fractional) |
| Any %, MNAR | Missing indicators + imputation |
| Correlated features | Surrogates (exploit correlations) |
| Many features missing together | Pattern-based analysis; consider subset modeling |
| Test data will have different missingness | Most robust imputation; test extensively |
You now understand the full landscape of missing value handling in decision trees: mechanisms (MCAR/MAR/MNAR), deletion approaches, fractional instances, surrogate splits, and missing indicators. The final page covers handling categorical features with many values.