Machine LearningDecision Trees

Tree Growing Algorithms

LevelIntermediate

Duration120 mins

TopicDecision Trees

5 / 5

Categorical Features

The Categorical Feature Challenge

Real-world data is full of categorical features: country codes, product categories, user types, industry sectors, color preferences. Unlike continuous features that have natural ordering, categorical features present unique challenges for tree algorithms.\n\nThe challenges compound with high-cardinality categories—features with many distinct values. A 'zip code' feature might have 40,000 values. A 'user ID' has millions. A 'product SKU' varies widely. How do we split on these efficiently? How do we prevent overfitting? What representations work best?\n\nThis page tackles categorical features comprehensively, from fundamental encoding strategies through advanced techniques for high-cardinality features, equipping you to handle any categorical data in practice.

What You Will Learn

By the end of this page, you will understand: (1) How different tree algorithms handle categorical splits; (2) Encoding strategies and their tradeoffs; (3) The high-cardinality problem and overfitting risks; (4) Target encoding and its regularization; (5) Ordinal features and ordered splits; (6) Best practices for real-world categorical data.

How Tree Algorithms Handle Categories

Three Fundamental Approaches:\n\nDifferent tree algorithms handle categorical features differently:\n\n1. Multi-way Splits (ID3, C4.5)\n- Create one branch per category value\n- Feature cannot be reused below (each value in separate subtree)\n- Natural representation; can fragment data\n\n2. Binary Subset Splits (CART)\n- Split categories into two groups: {A, C, E} vs {B, D}\n- Feature CAN be reused (with different groupings)\n- Exponential search space: $2^{k-1} - 1$ possible splits for $k$ categories\n\n3. Encoded Features (General)\n- Convert categorical to numeric representations\n- Use standard continuous splits\n- Encoding choice significantly affects performance

Categorical Handling Comparison
Approach	Search Space	Feature Reuse	Tree Depth	Best For
Multi-way (C4.5)	O(1)	No	Shallow	Low cardinality (<10)
Binary subset (CART)	O(2^k)	Yes	Deeper	Medium cardinality (10-30)
Binary + ordering opt.	O(k log k)	Yes	Deeper	Binary classification
Encoding	O(k) or O(n)	Depends	Depends	Any cardinality

categorical_split_complexity.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from itertools import combinations
from typing import Set, FrozenSet, List, Tuple
 
def count_binary_partitions(k: int) -> int:
    """
    Count number of non-trivial binary partitions of k items.
    
    Each partition divides items into two non-empty sets.
    We only count one of each complementary pair.
    """
    # Total subsets: 2^k
    # Subtract empty and full set: 2^k - 2
    # Divide by 2 (complementary pairs): (2^k - 2) / 2 = 2^(k-1) - 1
    return 2**(k-1) - 1
 
print("Exponential Growth of Binary Partitions")
print("=" * 50)
print(f"{'Categories (k)':<15} | {'Possible Splits':<20}")
print("-" * 50)
 
for k in [2, 3, 5, 10, 15, 20, 30]:
    splits = count_binary_partitions(k)
    print(f"{k:<15} | {splits:>20,}")
 
print()
print("At k=30: over 500 million possible splits!")
print("Clearly, exhaustive search is infeasible for high cardinality.")
 
def optimal_binary_split_binary_class(categories: List[str],
                                       category_positive_rate: dict) -> Tuple[FrozenSet, float]:
    """
    CART's optimization: for binary classification, sort categories
    by positive class rate and find optimal threshold in O(k log k).
    
    This exploits the fact that optimal split is always a threshold
    on sorted positive rates.
    """
    # Sort categories by positive rate
    sorted_cats = sorted(categories, key=lambda c: category_positive_rate.get(c, 0))
    
    # Only need to try k-1 splits (between consecutive sorted categories)
    # Instead of 2^(k-1) - 1 splits
    
    best_left = None
    best_impurity = float('inf')
    
    for i in range(1, len(sorted_cats)):
        left = frozenset(sorted_cats[:i])
        # Would compute Gini impurity here
        # For demo, just show the approach
        
    print(f"\nWith {len(categories)} categories:")
    print(f"  Exhaustive search: {count_binary_partitions(len(categories)):,} splits")
    print(f"  Optimized search:  {len(categories) - 1} splits (binary class)")
    
    return frozenset(sorted_cats[:len(sorted_cats)//2]), 0.0
 
# Demo
cats = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
rates = {c: np.random.random() for c in cats}
optimal_binary_split_binary_class(cats, rates)

The Binary Classification Optimization

For binary classification, CART proved that the optimal binary split on a categorical feature can be found in O(k log k) time. Sort categories by P(positive|category), then evaluate only k-1 threshold splits on this sorted order. This remarkable result makes CART practical for moderate-cardinality categoricals.

Encoding Strategies for Trees

One-Hot Encoding\n\nConvert each category to a binary column:\n- Category 'A' → [1, 0, 0]\n- Category 'B' → [0, 1, 0]\n- Category 'C' → [0, 0, 1]\n\nFor Trees: One-hot encoding with binary splits is equivalent to testing 'is category X?' at each split. This works but:\n- Requires k splits to fully partition k categories (inefficient)\n- Creates sparse, wide feature matrices\n- May lose natural groupings\n\nLabel Encoding\n\nAssign arbitrary integers to categories: A→0, B→1, C→2\n\nFor Trees: Dangerous! Tree will find thresholds like 'category <= 1.5', grouping A,B vs C based on arbitrary numbering. Only appropriate for ordinal categories with meaningful orders.

encoding_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from typing import Dict, List
 
def one_hot_encode(values: np.ndarray) -> tuple:
    """One-hot encode categorical values."""
    categories = np.unique(values)
    n_samples = len(values)
    n_categories = len(categories)
    
    encoded = np.zeros((n_samples, n_categories))
    cat_to_idx = {cat: i for i, cat in enumerate(categories)}
    
    for i, val in enumerate(values):
        encoded[i, cat_to_idx[val]] = 1
    
    column_names = [f"is_{cat}" for cat in categories]
    return encoded, column_names
 
def label_encode(values: np.ndarray) -> tuple:
    """Label encode categorical values."""
    categories = sorted(np.unique(values))
    cat_to_idx = {cat: i for i, cat in enumerate(categories)}
    
    encoded = np.array([cat_to_idx[v] for v in values])
    return encoded.reshape(-1, 1), ['label_encoded']
 
def target_encode(values: np.ndarray, target: np.ndarray,
                  smoothing: float = 10.0) -> tuple:
    """
    Target encode: replace category with smoothed mean target.
    
    Uses additive smoothing to regularize low-count categories.
    """
    global_mean = np.mean(target)
    
    categories = np.unique(values)
    cat_stats = {}
    
    for cat in categories:
        mask = values == cat
        count = np.sum(mask)
        cat_mean = np.mean(target[mask])
        
        # Smoothed mean: weight toward global mean for low counts
        smoothed = (count * cat_mean + smoothing * global_mean) / (count + smoothing)
        cat_stats[cat] = smoothed
    
    encoded = np.array([cat_stats[v] for v in values])
    return encoded.reshape(-1, 1), ['target_encoded']
 
# Comparison demonstration
np.random.seed(42)
n = 1000
 
# Categorical feature: product category
categories = ['Electronics', 'Books', 'Clothing', 'Food', 'Sports']
product_cat = np.random.choice(categories, n)
 
# Target: purchase probability varies by category
base_rates = {'Electronics': 0.3, 'Books': 0.5, 'Clothing': 0.4, 
              'Food': 0.6, 'Sports': 0.35}
purchase = np.array([np.random.random() < base_rates[c] for c in product_cat])
 
print("Encoding Comparison: Product Category")
print("=" * 60)
 
# One-hot
onehot, oh_cols = one_hot_encode(product_cat)
print(f"One-Hot: {onehot.shape[1]} columns")
print(f"  Columns: {oh_cols}")
 
# Label encoding
label, l_cols = label_encode(product_cat)
print(f"\nLabel Encoding: {label.shape[1]} column")
print(f"  Mapping: {dict(zip(sorted(categories), range(len(categories))))}")
print("  ⚠ Warning: Arbitrary ordering may mislead tree!")
 
# Target encoding
target_enc, t_cols = target_encode(product_cat, purchase.astype(float))
print(f"\nTarget Encoding: {target_enc.shape[1]} column")
print("  Category encodings (smoothed purchase rates):")
for cat in categories:
    mask = product_cat == cat
    print(f"    {cat}: {np.mean(target_enc[mask]):.3f}")

One-Hot Pros/Cons

•✓ No ordering assumptions
•✓ Simple, interpretable
•✓ Works with any algorithm
•✗ Creates many features
•✗ Inefficient for trees
•✗ New categories break encoding

Target Encoding Pros/Cons

•✓ Single column regardless of k
•✓ Captures category-target relationship
•✓ Works with high cardinality
•✗ Risk of target leakage
•✗ Requires regularization
•✗ Must compute on train set only

The High-Cardinality Problem

What Constitutes High Cardinality:\n\n- Low: 2-10 categories (gender, rating levels)\n- Medium: 10-100 categories (US states, product lines)\n- High: 100-10,000 categories (zip codes, merchant IDs)\n- Very High: 10,000+ categories (user IDs, product SKUs)\n\nThe Problems:\n\n1. Computational: Binary subset search becomes infeasible\n2. Overfitting: ID3's bias toward high-cardinality attributes\n3. Fragmentation: Each category may have few samples\n4. Generalization: Rare categories have unreliable statistics\n5. New categories: Test data may have unseen values

high_cardinality_problems.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from collections import Counter
 
def analyze_cardinality_issues(categories: np.ndarray, 
                                target: np.ndarray,
                                feature_name: str):
    """Analyze issues with a high-cardinality categorical."""
    
    print(f"High Cardinality Analysis: {feature_name}")
    print("=" * 60)
    
    unique_cats = np.unique(categories)
    n_categories = len(unique_cats)
    n_samples = len(categories)
    
    print(f"Total samples: {n_samples:,}")
    print(f"Unique categories: {n_categories:,}")
    print(f"Average samples per category: {n_samples/n_categories:.1f}")
    
    # Distribution of category frequencies
    counts = Counter(categories)
    count_values = list(counts.values())
    
    print(f"\nCategory frequency distribution:")
    print(f"  Min samples: {min(count_values)}")
    print(f"  Max samples: {max(count_values)}")
    print(f"  Median samples: {np.median(count_values):.0f}")
    
    # Categories with very few samples
    rare_threshold = 5
    rare_cats = sum(1 for c in count_values if c < rare_threshold)
    print(f"  Categories with <{rare_threshold} samples: {rare_cats} "
          f"({rare_cats/n_categories:.1%})")
    
    # Potential information gain bias
    # With many categories, random splits can achieve high IG
    print(f"\nBinary subset search space: 2^{n_categories-1} - 1 = "
          f"{'HUGE' if n_categories > 30 else f'{2**(n_categories-1)-1:,}'}")
    
    # Single-sample categories would perfectly classify those samples
    singleton_cats = sum(1 for c in count_values if c == 1)
    if singleton_cats > 0:
        print(f"\n⚠ Singleton categories (1 sample): {singleton_cats}")
        print("  These would overfit to individual samples!")
 
# Simulate high-cardinality scenario
np.random.seed(42)
n = 10000
 
# Zip code: ~1000 unique values
# Power law distribution: some zips common, many rare
zip_probs = np.random.power(0.5, 1000)
zip_probs = zip_probs / zip_probs.sum()
zip_codes = np.random.choice([f"ZIP_{i:04d}" for i in range(1000)], 
                              n, p=zip_probs)
 
# Target: somewhat correlated with zip
base_rate = 0.3
zip_effects = {z: np.random.normal(0, 0.1) for z in np.unique(zip_codes)}
target = np.array([
    np.random.random() < base_rate + zip_effects[z] 
    for z in zip_codes
]).astype(int)
 
analyze_cardinality_issues(zip_codes, target, "ZIP Code")

Information Gain Bias

Remember ID3's bias: attributes with more values get higher information gain simply because they can partition data more finely. A user ID with 10,000 unique values could achieve near-perfect entropy reduction by creating 10,000 single-sample leaves—completely useless for prediction. C4.5's gain ratio and regularization are essential for high-cardinality features.

Target Encoding with Regularization

Target Encoding Fundamentals:\n\nReplace each category with statistical summaries of the target variable for that category. For classification, use the class probability; for regression, use the mean.\n\nThe Critical Challenge: Target Leakage\n\nNaive target encoding uses the target to create features, causing information leakage:\n- Training instances 'see' the target through the encoding\n- Cross-validation scores become optimistically biased\n- Model appears to perform well but fails on truly unseen data\n\nSolutions:\n\n1. Leave-One-Out Encoding: For each instance, compute encoding using all OTHER instances in that category\n2. K-Fold Encoding: Compute encoding using out-of-fold targets only\n3. Smoothing: Blend category mean toward global mean based on sample size

target_encoding_proper.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
from typing import Dict, Tuple
from collections import defaultdict
 
def target_encode_with_smoothing(train_cats: np.ndarray,
                                  train_target: np.ndarray,
                                  test_cats: np.ndarray = None,
                                  smoothing: float = 10.0) -> Tuple[np.ndarray, np.ndarray]:
    """
    Target encoding with additive smoothing (Bayesian shrinkage).
    
    Smoothed_mean = (n * category_mean + m * global_mean) / (n + m)
    
    Where m is the smoothing parameter. High m pushes toward global mean.
    """
    global_mean = np.mean(train_target)
    
    # Compute category statistics
    cat_counts = defaultdict(int)
    cat_sums = defaultdict(float)
    
    for cat, target in zip(train_cats, train_target):
        cat_counts[cat] += 1
        cat_sums[cat] += target
    
    # Compute smoothed means
    cat_encoded = {}
    for cat in cat_counts:
        n = cat_counts[cat]
        cat_mean = cat_sums[cat] / n
        smoothed = (n * cat_mean + smoothing * global_mean) / (n + smoothing)
        cat_encoded[cat] = smoothed
    
    # Encode train
    train_encoded = np.array([cat_encoded[c] for c in train_cats])
    
    # Encode test (unseen categories get global mean)
    if test_cats is not None:
        test_encoded = np.array([
            cat_encoded.get(c, global_mean) for c in test_cats
        ])
        return train_encoded, test_encoded
    
    return train_encoded, None
 
def target_encode_kfold(categories: np.ndarray,
                        target: np.ndarray,
                        n_folds: int = 5,
                        smoothing: float = 10.0) -> np.ndarray:
    """
    K-Fold target encoding to prevent leakage.
    
    Each instance is encoded using only out-of-fold statistics.
    """
    n = len(categories)
    encoded = np.zeros(n)
    global_mean = np.mean(target)
    
    # Create fold assignments
    fold_ids = np.tile(np.arange(n_folds), n // n_folds + 1)[:n]
    np.random.shuffle(fold_ids)
    
    for fold in range(n_folds):
        # Out-of-fold mask
        oof_mask = fold_ids != fold
        in_fold_mask = fold_ids == fold
        
        # Compute statistics from out-of-fold only
        oof_cats = categories[oof_mask]
        oof_target = target[oof_mask]
        
        cat_counts = defaultdict(int)
        cat_sums = defaultdict(float)
        
        for cat, t in zip(oof_cats, oof_target):
            cat_counts[cat] += 1
            cat_sums[cat] += t
        
        # Encode in-fold instances
        for i in np.where(in_fold_mask)[0]:
            cat = categories[i]
            if cat in cat_counts:
                n_cat = cat_counts[cat]
                cat_mean = cat_sums[cat] / n_cat
                smoothed = (n_cat * cat_mean + smoothing * global_mean) / (n_cat + smoothing)
                encoded[i] = smoothed
            else:
                encoded[i] = global_mean
    
    return encoded
 
# Demonstration
np.random.seed(42)
n = 1000
 
# Categories with varying frequencies
categories = np.concatenate([
    np.repeat('Common_A', 300),
    np.repeat('Common_B', 250),
    np.repeat('Medium_C', 100),
    np.repeat('Medium_D', 80),
    np.repeat('Rare_E', 50),
    np.repeat('Rare_F', 30),
    np.repeat('VeryRare_G', 10),
    np.repeat('VeryRare_H', 5),
    np.array(['Singleton_' + str(i) for i in range(175)])  # 175 singletons
])
np.random.shuffle(categories)
 
# Target correlated with category
cat_rates = {'Common_A': 0.7, 'Common_B': 0.3, 'Medium_C': 0.5,
             'Medium_D': 0.6, 'Rare_E': 0.8, 'Rare_F': 0.2,
             'VeryRare_G': 0.9, 'VeryRare_H': 0.1}
# Singletons: random
for i in range(175):
    cat_rates[f'Singleton_{i}'] = np.random.random()
 
target = np.array([np.random.random() < cat_rates[c] for c in categories]).astype(float)
 
print("Target Encoding with Regularization")
print("=" * 55)
print(f"Global mean: {np.mean(target):.3f}")
print()
 
# Compare naive vs smoothed for rare category
print("Effect of smoothing on 'VeryRare_G' (10 samples):")
for smoothing in [0, 1, 10, 100]:
    enc, _ = target_encode_with_smoothing(categories, target, smoothing=smoothing)
    vrg_mask = categories == 'VeryRare_G'
    print(f"  Smoothing={smoothing:<3}: encoded as {np.mean(enc[vrg_mask]):.3f}")
 
print()
print("With more smoothing, rare categories move toward global mean (0.5).")
print("This prevents overfitting to small sample statistics.")

Choosing the Smoothing Parameter

The smoothing parameter (often called 'm' or 'k') controls regularization strength. A good heuristic: set it to the minimum sample size you'd trust for reliable statistics. m=10 means categories with fewer than 10 samples are heavily regularized toward the global mean.

Ordinal Features: When Order Matters

Ordinal vs Nominal Categories:\n\nNominal (no order): color, country, genre\nOrdinal (ordered): education level, satisfaction rating, size (S/M/L/XL)\n\nWhy Ordinal Treatment Matters:\n\nFor ordinal features, only contiguous groupings make sense:\n- Valid: {Low, Medium} vs {High} (threshold split)\n- Invalid: {Low, High} vs {Medium} (breaks ordering)\n\nThis constraint reduces the search space from $2^{k-1} - 1$ to $k - 1$ possible splits—a massive computational saving.\n\nEncoding Ordinal Features:\n\nSimply map to integers respecting order: Small=1, Medium=2, Large=3, XL=4\n\nUnlike nominal label encoding (which is arbitrary), ordinal label encoding is meaningful. The tree can legitimately say 'Size <= 2' (meaning S or M).

ordinal_handling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
from typing import List, Tuple, Optional
 
def ordinal_encode(values: np.ndarray, 
                   order: List[str]) -> np.ndarray:
    """
    Encode ordinal categorical with explicit ordering.
    
    Parameters:
        values: Categorical values
        order: List of categories in ascending order
        
    Returns:
        Integer encoded values respecting order
    """
    order_map = {cat: i for i, cat in enumerate(order)}
    return np.array([order_map[v] for v in values])
 
def find_ordinal_split(x: np.ndarray, y: np.ndarray,
                       order: List[str]) -> Tuple[str, float]:
    """
    Find best split for ordinal feature.
    
    Only considers threshold splits that respect ordering.
    O(k-1) instead of O(2^k).
    """
    # Encode
    x_encoded = ordinal_encode(x, order)
    
    best_threshold = None
    best_gini_reduction = -np.inf
    
    parent_gini = gini_from_counts(np.bincount(y))
    n = len(y)
    
    # Only k-1 valid splits
    for threshold in range(len(order) - 1):
        left_mask = x_encoded <= threshold
        right_mask = ~left_mask
        
        n_left = np.sum(left_mask)
        n_right = np.sum(right_mask)
        
        if n_left == 0 or n_right == 0:
            continue
        
        left_gini = gini_from_counts(np.bincount(y[left_mask]))
        right_gini = gini_from_counts(np.bincount(y[right_mask]))
        
        weighted_gini = (n_left/n)*left_gini + (n_right/n)*right_gini
        reduction = parent_gini - weighted_gini
        
        if reduction > best_gini_reduction:
            best_gini_reduction = reduction
            best_threshold = threshold
    
    # Return human-readable split
    split_description = f"{order[best_threshold]} or below vs {order[best_threshold+1]} or above"
    return split_description, best_gini_reduction
 
def gini_from_counts(counts):
    """Compute Gini impurity from counts."""
    counts = counts[counts > 0]
    if len(counts) == 0:
        return 0
    probs = counts / np.sum(counts)
    return 1 - np.sum(probs**2)
 
# Demonstration: Education level predicting income
np.random.seed(42)
n = 500
 
education_order = ['HighSchool', 'SomeCollege', 'Bachelors', 'Masters', 'PhD']
education = np.random.choice(education_order, n, 
                              p=[0.3, 0.25, 0.25, 0.15, 0.05])
 
# Income class: probability increases with education
ed_to_rate = {'HighSchool': 0.2, 'SomeCollege': 0.35, 
              'Bachelors': 0.6, 'Masters': 0.75, 'PhD': 0.85}
high_income = np.array([np.random.random() < ed_to_rate[e] 
                        for e in education]).astype(int)
 
print("Ordinal Feature Handling: Education → High Income")
print("=" * 60)
 
split, reduction = find_ordinal_split(education, high_income, education_order)
print(f"Best ordinal split: {split}")
print(f"Gini reduction: {reduction:.4f}")
 
print()
print("Ordinal encoding search space:", len(education_order) - 1, "splits")
print("Nominal (binary subset) would be:", 2**(len(education_order)-1) - 1, "splits")

Best Practices for Categorical Features

Practical Guidelines

•Low cardinality (<20): Use native categorical handling (CART subset splits or one-hot). Simple and effective.
•Medium cardinality (20-100): Consider target encoding with regularization. K-fold encoding prevents leakage.
•High cardinality (100+): Target encoding is essential. Group rare categories or use embedding approaches.
•Ordinal features: Always encode with explicit ordering. Use integer encoding respecting order.
•Rare categories: Either group into 'Other' bucket or use heavy smoothing in target encoding.
•Unseen categories: Plan for test-time unknowns. Default to global mean or 'Other' category encoding.
•Feature interactions: For important categoricals, consider creating interaction features with other variables.

Encoding Strategy Decision Matrix
Cardinality	Algorithm	Recommended Encoding
2-10	Any tree	Native categorical or one-hot
10-50	CART/XGBoost	Native with regularization
10-50	Random Forest	One-hot or target encoding
50-500	Any	Target encoding (K-fold, smoothed)
500+	Any	Target encoding + feature hashing
Ordinal (any)	Any	Integer encoding respecting order

Module Complete

You've completed the Tree Growing Algorithms module! You now understand ID3, C4.5, and CART algorithms; how to handle missing values through deletion, fractional instances, and surrogates; and how to manage categorical features from low to very high cardinality. These skills prepare you for practical decision tree applications and ensemble methods.

5 / 5

Loading learning content...

Machine LearningDecision Trees

Tree Growing Algorithms

LevelIntermediate

Duration120 mins

TopicDecision Trees

5 / 5

Categorical Features

The Categorical Feature Challenge

What You Will Learn

How Tree Algorithms Handle Categories

Categorical Handling Comparison
Approach	Search Space	Feature Reuse	Tree Depth	Best For
Multi-way (C4.5)	O(1)	No	Shallow	Low cardinality (<10)
Binary subset (CART)	O(2^k)	Yes	Deeper	Medium cardinality (10-30)
Binary + ordering opt.	O(k log k)	Yes	Deeper	Binary classification
Encoding	O(k) or O(n)	Depends	Depends	Any cardinality

categorical_split_complexity.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from itertools import combinations
from typing import Set, FrozenSet, List, Tuple
 
def count_binary_partitions(k: int) -> int:
    """
    Count number of non-trivial binary partitions of k items.
    
    Each partition divides items into two non-empty sets.
    We only count one of each complementary pair.
    """
    # Total subsets: 2^k
    # Subtract empty and full set: 2^k - 2
    # Divide by 2 (complementary pairs): (2^k - 2) / 2 = 2^(k-1) - 1
    return 2**(k-1) - 1
 
print("Exponential Growth of Binary Partitions")
print("=" * 50)
print(f"{'Categories (k)':<15} | {'Possible Splits':<20}")
print("-" * 50)
 
for k in [2, 3, 5, 10, 15, 20, 30]:
    splits = count_binary_partitions(k)
    print(f"{k:<15} | {splits:>20,}")
 
print()
print("At k=30: over 500 million possible splits!")
print("Clearly, exhaustive search is infeasible for high cardinality.")
 
def optimal_binary_split_binary_class(categories: List[str],
                                       category_positive_rate: dict) -> Tuple[FrozenSet, float]:
    """
    CART's optimization: for binary classification, sort categories
    by positive class rate and find optimal threshold in O(k log k).
    
    This exploits the fact that optimal split is always a threshold
    on sorted positive rates.
    """
    # Sort categories by positive rate
    sorted_cats = sorted(categories, key=lambda c: category_positive_rate.get(c, 0))
    
    # Only need to try k-1 splits (between consecutive sorted categories)
    # Instead of 2^(k-1) - 1 splits
    
    best_left = None
    best_impurity = float('inf')
    
    for i in range(1, len(sorted_cats)):
        left = frozenset(sorted_cats[:i])
        # Would compute Gini impurity here
        # For demo, just show the approach
        
    print(f"\nWith {len(categories)} categories:")
    print(f"  Exhaustive search: {count_binary_partitions(len(categories)):,} splits")
    print(f"  Optimized search:  {len(categories) - 1} splits (binary class)")
    
    return frozenset(sorted_cats[:len(sorted_cats)//2]), 0.0
 
# Demo
cats = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
rates = {c: np.random.random() for c in cats}
optimal_binary_split_binary_class(cats, rates)

The Binary Classification Optimization

Encoding Strategies for Trees

encoding_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from typing import Dict, List
 
def one_hot_encode(values: np.ndarray) -> tuple:
    """One-hot encode categorical values."""
    categories = np.unique(values)
    n_samples = len(values)
    n_categories = len(categories)
    
    encoded = np.zeros((n_samples, n_categories))
    cat_to_idx = {cat: i for i, cat in enumerate(categories)}
    
    for i, val in enumerate(values):
        encoded[i, cat_to_idx[val]] = 1
    
    column_names = [f"is_{cat}" for cat in categories]
    return encoded, column_names
 
def label_encode(values: np.ndarray) -> tuple:
    """Label encode categorical values."""
    categories = sorted(np.unique(values))
    cat_to_idx = {cat: i for i, cat in enumerate(categories)}
    
    encoded = np.array([cat_to_idx[v] for v in values])
    return encoded.reshape(-1, 1), ['label_encoded']
 
def target_encode(values: np.ndarray, target: np.ndarray,
                  smoothing: float = 10.0) -> tuple:
    """
    Target encode: replace category with smoothed mean target.
    
    Uses additive smoothing to regularize low-count categories.
    """
    global_mean = np.mean(target)
    
    categories = np.unique(values)
    cat_stats = {}
    
    for cat in categories:
        mask = values == cat
        count = np.sum(mask)
        cat_mean = np.mean(target[mask])
        
        # Smoothed mean: weight toward global mean for low counts
        smoothed = (count * cat_mean + smoothing * global_mean) / (count + smoothing)
        cat_stats[cat] = smoothed
    
    encoded = np.array([cat_stats[v] for v in values])
    return encoded.reshape(-1, 1), ['target_encoded']
 
# Comparison demonstration
np.random.seed(42)
n = 1000
 
# Categorical feature: product category
categories = ['Electronics', 'Books', 'Clothing', 'Food', 'Sports']
product_cat = np.random.choice(categories, n)
 
# Target: purchase probability varies by category
base_rates = {'Electronics': 0.3, 'Books': 0.5, 'Clothing': 0.4, 
              'Food': 0.6, 'Sports': 0.35}
purchase = np.array([np.random.random() < base_rates[c] for c in product_cat])
 
print("Encoding Comparison: Product Category")
print("=" * 60)
 
# One-hot
onehot, oh_cols = one_hot_encode(product_cat)
print(f"One-Hot: {onehot.shape[1]} columns")
print(f"  Columns: {oh_cols}")
 
# Label encoding
label, l_cols = label_encode(product_cat)
print(f"\nLabel Encoding: {label.shape[1]} column")
print(f"  Mapping: {dict(zip(sorted(categories), range(len(categories))))}")
print("  ⚠ Warning: Arbitrary ordering may mislead tree!")
 
# Target encoding
target_enc, t_cols = target_encode(product_cat, purchase.astype(float))
print(f"\nTarget Encoding: {target_enc.shape[1]} column")
print("  Category encodings (smoothed purchase rates):")
for cat in categories:
    mask = product_cat == cat
    print(f"    {cat}: {np.mean(target_enc[mask]):.3f}")

One-Hot Pros/Cons

•✓ No ordering assumptions
•✓ Simple, interpretable
•✓ Works with any algorithm
•✗ Creates many features
•✗ Inefficient for trees
•✗ New categories break encoding

Target Encoding Pros/Cons

•✓ Single column regardless of k
•✓ Captures category-target relationship
•✓ Works with high cardinality
•✗ Risk of target leakage
•✗ Requires regularization
•✗ Must compute on train set only

The High-Cardinality Problem

high_cardinality_problems.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from collections import Counter
 
def analyze_cardinality_issues(categories: np.ndarray, 
                                target: np.ndarray,
                                feature_name: str):
    """Analyze issues with a high-cardinality categorical."""
    
    print(f"High Cardinality Analysis: {feature_name}")
    print("=" * 60)
    
    unique_cats = np.unique(categories)
    n_categories = len(unique_cats)
    n_samples = len(categories)
    
    print(f"Total samples: {n_samples:,}")
    print(f"Unique categories: {n_categories:,}")
    print(f"Average samples per category: {n_samples/n_categories:.1f}")
    
    # Distribution of category frequencies
    counts = Counter(categories)
    count_values = list(counts.values())
    
    print(f"\nCategory frequency distribution:")
    print(f"  Min samples: {min(count_values)}")
    print(f"  Max samples: {max(count_values)}")
    print(f"  Median samples: {np.median(count_values):.0f}")
    
    # Categories with very few samples
    rare_threshold = 5
    rare_cats = sum(1 for c in count_values if c < rare_threshold)
    print(f"  Categories with <{rare_threshold} samples: {rare_cats} "
          f"({rare_cats/n_categories:.1%})")
    
    # Potential information gain bias
    # With many categories, random splits can achieve high IG
    print(f"\nBinary subset search space: 2^{n_categories-1} - 1 = "
          f"{'HUGE' if n_categories > 30 else f'{2**(n_categories-1)-1:,}'}")
    
    # Single-sample categories would perfectly classify those samples
    singleton_cats = sum(1 for c in count_values if c == 1)
    if singleton_cats > 0:
        print(f"\n⚠ Singleton categories (1 sample): {singleton_cats}")
        print("  These would overfit to individual samples!")
 
# Simulate high-cardinality scenario
np.random.seed(42)
n = 10000
 
# Zip code: ~1000 unique values
# Power law distribution: some zips common, many rare
zip_probs = np.random.power(0.5, 1000)
zip_probs = zip_probs / zip_probs.sum()
zip_codes = np.random.choice([f"ZIP_{i:04d}" for i in range(1000)], 
                              n, p=zip_probs)
 
# Target: somewhat correlated with zip
base_rate = 0.3
zip_effects = {z: np.random.normal(0, 0.1) for z in np.unique(zip_codes)}
target = np.array([
    np.random.random() < base_rate + zip_effects[z] 
    for z in zip_codes
]).astype(int)
 
analyze_cardinality_issues(zip_codes, target, "ZIP Code")

Information Gain Bias

Target Encoding with Regularization

target_encoding_proper.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
from typing import Dict, Tuple
from collections import defaultdict
 
def target_encode_with_smoothing(train_cats: np.ndarray,
                                  train_target: np.ndarray,
                                  test_cats: np.ndarray = None,
                                  smoothing: float = 10.0) -> Tuple[np.ndarray, np.ndarray]:
    """
    Target encoding with additive smoothing (Bayesian shrinkage).
    
    Smoothed_mean = (n * category_mean + m * global_mean) / (n + m)
    
    Where m is the smoothing parameter. High m pushes toward global mean.
    """
    global_mean = np.mean(train_target)
    
    # Compute category statistics
    cat_counts = defaultdict(int)
    cat_sums = defaultdict(float)
    
    for cat, target in zip(train_cats, train_target):
        cat_counts[cat] += 1
        cat_sums[cat] += target
    
    # Compute smoothed means
    cat_encoded = {}
    for cat in cat_counts:
        n = cat_counts[cat]
        cat_mean = cat_sums[cat] / n
        smoothed = (n * cat_mean + smoothing * global_mean) / (n + smoothing)
        cat_encoded[cat] = smoothed
    
    # Encode train
    train_encoded = np.array([cat_encoded[c] for c in train_cats])
    
    # Encode test (unseen categories get global mean)
    if test_cats is not None:
        test_encoded = np.array([
            cat_encoded.get(c, global_mean) for c in test_cats
        ])
        return train_encoded, test_encoded
    
    return train_encoded, None
 
def target_encode_kfold(categories: np.ndarray,
                        target: np.ndarray,
                        n_folds: int = 5,
                        smoothing: float = 10.0) -> np.ndarray:
    """
    K-Fold target encoding to prevent leakage.
    
    Each instance is encoded using only out-of-fold statistics.
    """
    n = len(categories)
    encoded = np.zeros(n)
    global_mean = np.mean(target)
    
    # Create fold assignments
    fold_ids = np.tile(np.arange(n_folds), n // n_folds + 1)[:n]
    np.random.shuffle(fold_ids)
    
    for fold in range(n_folds):
        # Out-of-fold mask
        oof_mask = fold_ids != fold
        in_fold_mask = fold_ids == fold
        
        # Compute statistics from out-of-fold only
        oof_cats = categories[oof_mask]
        oof_target = target[oof_mask]
        
        cat_counts = defaultdict(int)
        cat_sums = defaultdict(float)
        
        for cat, t in zip(oof_cats, oof_target):
            cat_counts[cat] += 1
            cat_sums[cat] += t
        
        # Encode in-fold instances
        for i in np.where(in_fold_mask)[0]:
            cat = categories[i]
            if cat in cat_counts:
                n_cat = cat_counts[cat]
                cat_mean = cat_sums[cat] / n_cat
                smoothed = (n_cat * cat_mean + smoothing * global_mean) / (n_cat + smoothing)
                encoded[i] = smoothed
            else:
                encoded[i] = global_mean
    
    return encoded
 
# Demonstration
np.random.seed(42)
n = 1000
 
# Categories with varying frequencies
categories = np.concatenate([
    np.repeat('Common_A', 300),
    np.repeat('Common_B', 250),
    np.repeat('Medium_C', 100),
    np.repeat('Medium_D', 80),
    np.repeat('Rare_E', 50),
    np.repeat('Rare_F', 30),
    np.repeat('VeryRare_G', 10),
    np.repeat('VeryRare_H', 5),
    np.array(['Singleton_' + str(i) for i in range(175)])  # 175 singletons
])
np.random.shuffle(categories)
 
# Target correlated with category
cat_rates = {'Common_A': 0.7, 'Common_B': 0.3, 'Medium_C': 0.5,
             'Medium_D': 0.6, 'Rare_E': 0.8, 'Rare_F': 0.2,
             'VeryRare_G': 0.9, 'VeryRare_H': 0.1}
# Singletons: random
for i in range(175):
    cat_rates[f'Singleton_{i}'] = np.random.random()
 
target = np.array([np.random.random() < cat_rates[c] for c in categories]).astype(float)
 
print("Target Encoding with Regularization")
print("=" * 55)
print(f"Global mean: {np.mean(target):.3f}")
print()
 
# Compare naive vs smoothed for rare category
print("Effect of smoothing on 'VeryRare_G' (10 samples):")
for smoothing in [0, 1, 10, 100]:
    enc, _ = target_encode_with_smoothing(categories, target, smoothing=smoothing)
    vrg_mask = categories == 'VeryRare_G'
    print(f"  Smoothing={smoothing:<3}: encoded as {np.mean(enc[vrg_mask]):.3f}")
 
print()
print("With more smoothing, rare categories move toward global mean (0.5).")
print("This prevents overfitting to small sample statistics.")

Choosing the Smoothing Parameter

Ordinal Features: When Order Matters

ordinal_handling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
from typing import List, Tuple, Optional
 
def ordinal_encode(values: np.ndarray, 
                   order: List[str]) -> np.ndarray:
    """
    Encode ordinal categorical with explicit ordering.
    
    Parameters:
        values: Categorical values
        order: List of categories in ascending order
        
    Returns:
        Integer encoded values respecting order
    """
    order_map = {cat: i for i, cat in enumerate(order)}
    return np.array([order_map[v] for v in values])
 
def find_ordinal_split(x: np.ndarray, y: np.ndarray,
                       order: List[str]) -> Tuple[str, float]:
    """
    Find best split for ordinal feature.
    
    Only considers threshold splits that respect ordering.
    O(k-1) instead of O(2^k).
    """
    # Encode
    x_encoded = ordinal_encode(x, order)
    
    best_threshold = None
    best_gini_reduction = -np.inf
    
    parent_gini = gini_from_counts(np.bincount(y))
    n = len(y)
    
    # Only k-1 valid splits
    for threshold in range(len(order) - 1):
        left_mask = x_encoded <= threshold
        right_mask = ~left_mask
        
        n_left = np.sum(left_mask)
        n_right = np.sum(right_mask)
        
        if n_left == 0 or n_right == 0:
            continue
        
        left_gini = gini_from_counts(np.bincount(y[left_mask]))
        right_gini = gini_from_counts(np.bincount(y[right_mask]))
        
        weighted_gini = (n_left/n)*left_gini + (n_right/n)*right_gini
        reduction = parent_gini - weighted_gini
        
        if reduction > best_gini_reduction:
            best_gini_reduction = reduction
            best_threshold = threshold
    
    # Return human-readable split
    split_description = f"{order[best_threshold]} or below vs {order[best_threshold+1]} or above"
    return split_description, best_gini_reduction
 
def gini_from_counts(counts):
    """Compute Gini impurity from counts."""
    counts = counts[counts > 0]
    if len(counts) == 0:
        return 0
    probs = counts / np.sum(counts)
    return 1 - np.sum(probs**2)
 
# Demonstration: Education level predicting income
np.random.seed(42)
n = 500
 
education_order = ['HighSchool', 'SomeCollege', 'Bachelors', 'Masters', 'PhD']
education = np.random.choice(education_order, n, 
                              p=[0.3, 0.25, 0.25, 0.15, 0.05])
 
# Income class: probability increases with education
ed_to_rate = {'HighSchool': 0.2, 'SomeCollege': 0.35, 
              'Bachelors': 0.6, 'Masters': 0.75, 'PhD': 0.85}
high_income = np.array([np.random.random() < ed_to_rate[e] 
                        for e in education]).astype(int)
 
print("Ordinal Feature Handling: Education → High Income")
print("=" * 60)
 
split, reduction = find_ordinal_split(education, high_income, education_order)
print(f"Best ordinal split: {split}")
print(f"Gini reduction: {reduction:.4f}")
 
print()
print("Ordinal encoding search space:", len(education_order) - 1, "splits")
print("Nominal (binary subset) would be:", 2**(len(education_order)-1) - 1, "splits")

Best Practices for Categorical Features

Practical Guidelines

•Low cardinality (<20): Use native categorical handling (CART subset splits or one-hot). Simple and effective.
•Medium cardinality (20-100): Consider target encoding with regularization. K-fold encoding prevents leakage.
•High cardinality (100+): Target encoding is essential. Group rare categories or use embedding approaches.
•Ordinal features: Always encode with explicit ordering. Use integer encoding respecting order.
•Rare categories: Either group into 'Other' bucket or use heavy smoothing in target encoding.
•Unseen categories: Plan for test-time unknowns. Default to global mean or 'Other' category encoding.
•Feature interactions: For important categoricals, consider creating interaction features with other variables.

Encoding Strategy Decision Matrix
Cardinality	Algorithm	Recommended Encoding
2-10	Any tree	Native categorical or one-hot
10-50	CART/XGBoost	Native with regularization
10-50	Random Forest	One-hot or target encoding
50-500	Any	Target encoding (K-fold, smoothed)
500+	Any	Target encoding + feature hashing
Ordinal (any)	Any	Integer encoding respecting order

Module Complete

5 / 5