Feature Engineering For Boosting - Learning Module

Loading content...

0/245

Categorical Handling

The Categorical Feature Challenge

Categorical features are ubiquitous in real-world datasets: product categories, geographic regions, user segments, device types, and countless domain-specific classifications. Yet machine learning algorithms fundamentally operate on numerical inputs, creating a representation gap that practitioners must bridge.

For gradient boosting specifically, categorical handling is both critically important and surprisingly nuanced. Different boosting libraries implement different categorical handling strategies, each with distinct tradeoffs. Choosing the wrong approach can dramatically hurt model performance or training efficiency.

What You Will Learn

By the end of this page, you will understand the full landscape of categorical encoding methods, how XGBoost, LightGBM, and CatBoost each handle categoricals natively, when to use which encoding strategy, and how to handle high-cardinality and hierarchical categoricals effectively.

Encoding Methods Overview

Categorical Encoding Methods Comparison
Method	Cardinality	Pros	Cons
One-Hot	Low (<20)	No information loss, interpretable	Dimensionality explosion, sparse
Label/Ordinal	Any	Single column, efficient	Imposes false ordering
Target Encoding	High	Captures predictive signal	Requires regularization (leakage risk)
Frequency Encoding	High	No leakage, captures popularity	No target signal
Binary Encoding	Medium	Compact (log₂k columns)	Loses some information
Hash Encoding	Very High	Fixed dimensionality	Collisions, not invertible
Embedding	Very High	Learned representations	Requires neural network training

Ordinal vs. Nominal Categoricals:

Not all categorical features are the same:

Nominal: No natural ordering (colors, countries, product types)
Ordinal: Natural ordering exists (education level, size categories, ratings)

For ordinal variables, label encoding respects the underlying order:

education_order = {'High School': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4}
df['education_encoded'] = df['education'].map(education_order)

For nominal variables, label encoding imposes a false ordering that can mislead the model. One-hot or specialized encodings are preferred.

Native Categorical Support in Boosting Libraries

Modern boosting libraries provide native categorical handling that often outperforms manual encoding.

LightGBM Native Categoricals:

LightGBM finds optimal categorical splits by considering all possible partitions of categories into two groups. For efficiency, it uses a sorting-based algorithm:

Compute gradient statistics for each category
Sort categories by gradient sum / count ratio
Find optimal split point in sorted order

import lightgbm as lgb

# Specify categorical features by name or index
model = lgb.LGBMClassifier(
    categorical_feature=['city', 'product_type', 'device']
)

# Or using Dataset API
train_data = lgb.Dataset(
    data=X_train, label=y_train,
    categorical_feature=['city', 'product_type']
)

CatBoost Native Categoricals:

CatBoost uses ordered target statistics (covered in target encoding page) computed dynamically during training:

import catboost as cb

model = cb.CatBoostClassifier(
    cat_features=['city', 'product_type', 'device'],
    one_hot_max_size=10  # One-hot for low cardinality
)

# CatBoost handles encoding automatically
model.fit(X_train, y_train)

XGBoost Native Categoricals (v1.5+):

XGBoost added experimental categorical support using an optimal partitioning algorithm:

import xgboost as xgb

# Enable categorical feature support
model = xgb.XGBClassifier(
    tree_method='hist',
    enable_categorical=True
)

# Ensure columns are pandas Categorical dtype
X_train['city'] = X_train['city'].astype('category')

When to Use Native vs. Manual Encoding

Use native handling when: Categories number between 10-1000, you want optimal splitting, and using LightGBM or CatBoost. Use manual encoding when: Categories exceed 1000 (hash/target encoding), you need reproducibility across different libraries, or you're adding domain knowledge through encoding choices.

High-Cardinality Category Strategies

High-cardinality features (1000+ categories) require special handling. One-hot encoding is infeasible, and even native categorical support may struggle.

Strategy 1: Grouping/Binning

Reduce cardinality by grouping rare or similar categories:

# Group by frequency threshold
value_counts = df['category'].value_counts()
top_categories = value_counts[value_counts >= 100].index
df['category_grouped'] = df['category'].where(
    df['category'].isin(top_categories), 'OTHER'
)

# Group by hierarchy (if available)
# ZIP codes -> first 3 digits (regions)
df['region'] = df['zip_code'].str[:3]

Strategy 2: Hashing

Map categories to fixed-size hash buckets:

from sklearn.feature_extraction import FeatureHasher

hasher = FeatureHasher(n_features=256, input_type='string')
X_hashed = hasher.fit_transform(df['category'].apply(lambda x: [x]))

Pros: Fixed dimensionality, handles unseen categories Cons: Collisions (different categories map to same bucket), not reversible

high_cardinality_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
 
def encode_high_cardinality(df, cat_col, target_col, strategy='combined'):
    """
    Comprehensive encoding for high-cardinality categorical features.
    
    Creates multiple derived features capturing different aspects.
    """
    result = df.copy()
    
    if strategy == 'combined':
        # 1. Target encoding with K-fold regularization
        kf = KFold(n_splits=5, shuffle=True, random_state=42)
        result[f'{cat_col}_target'] = np.nan
        
        for train_idx, val_idx in kf.split(df):
            train_mean = df.iloc[train_idx].groupby(cat_col)[target_col].mean()
            global_mean = df.iloc[train_idx][target_col].mean()
            
            # Smoothed encoding
            counts = df.iloc[train_idx].groupby(cat_col).size()
            smoothed = (counts * train_mean + 10 * global_mean) / (counts + 10)
            
            result.loc[df.index[val_idx], f'{cat_col}_target'] =                 df.iloc[val_idx][cat_col].map(smoothed)
        
        # Fill NaN with global mean
        result[f'{cat_col}_target'].fillna(df[target_col].mean(), inplace=True)
        
        # 2. Frequency encoding
        freq = df[cat_col].value_counts()
        result[f'{cat_col}_freq'] = df[cat_col].map(freq)
        result[f'{cat_col}_freq_log'] = np.log1p(result[f'{cat_col}_freq'])
        
        # 3. Rare indicator
        result[f'{cat_col}_is_rare'] = (result[f'{cat_col}_freq'] < 10).astype(int)
        
        # 4. Category rank (by frequency)
        rank = freq.rank(ascending=False, method='dense')
        result[f'{cat_col}_rank'] = df[cat_col].map(rank)
        
    return result

Encoding Selection Guidelines

Choosing the right encoding depends on cardinality, interpretability needs, and available compute.

Encoding Decision Matrix
Cardinality	Ordinal?	Recommended Approach
< 5	No	One-hot encoding
< 5	Yes	Label encoding with explicit order
5-20	No	One-hot or native categorical
5-20	Yes	Label encoding
20-100	No	Native categorical (LightGBM/CatBoost)
100-1000	No	Target + frequency encoding combination
1000	No	Target encoding + hashing/grouping
10000	No	Hashing or learned embeddings

Decision Factors

•Memory constraints: One-hot uses O(n×k) memory; target/frequency use O(n)
•Interpretability: One-hot and label are directly interpretable; target encoding less so
•Training speed: Native categorical is fastest; one-hot with many categories is slowest
•Unseen categories: Hash encoding handles gracefully; one-hot fails without handling
•Information preservation: One-hot preserves all; hashing loses due to collisions

Hierarchical Categorical Features

Many categorical features have natural hierarchies:

Geographic: Country → State → City → ZIP
Product: Department → Category → Subcategory → SKU
Temporal: Year → Quarter → Month → Week
Organizational: Division → Department → Team

Leveraging Hierarchies:

def create_hierarchical_features(df):
    '''Create features at multiple hierarchy levels.'''
    # Product hierarchy
    df['department'] = df['product_id'].map(get_department)
    df['category'] = df['product_id'].map(get_category)
    
    # Encode each level
    for level in ['department', 'category', 'product_id']:
        # Target encoding at each level
        df[f'{level}_target'] = target_encode(df[level])
        # Frequency at each level
        df[f'{level}_freq'] = df[level].map(df[level].value_counts())
    
    return df

This creates a cascade of features from general to specific, allowing the model to learn both broad patterns and specific details.

Hierarchy Benefits

Higher-level categories provide regularization for rare specific categories. If a specific product is new, its category and department encodings still provide signal. This is similar to shrinkage toward a group mean in hierarchical models.

Summary: Categorical Handling for Boosting

Key Takeaways

•Multiple encoding methods exist, each with tradeoffs — One-hot, label, target, frequency, hash, and embeddings serve different purposes.
•Cardinality drives encoding choice — Low (one-hot), medium (native), high (target+frequency), very high (hashing).
•Modern boosting libraries have native categorical support — LightGBM and CatBoost handle categoricals efficiently without manual encoding.
•High-cardinality requires combination strategies — Target encoding + frequency + grouping together capture multiple aspects.
•Hierarchical categoricals benefit from multi-level encoding — Create features at each level of the hierarchy for regularization.
•Always consider unseen categories — Have a strategy for test-time categories not seen during training.

Page Complete

You now have a comprehensive framework for categorical feature handling in gradient boosting. Next, we explore feature selection—identifying which features (including engineered categoricals) provide the most value for your boosting model.