Loading content...
Categorical features are ubiquitous in real-world datasets: product categories, geographic regions, user segments, device types, and countless domain-specific classifications. Yet machine learning algorithms fundamentally operate on numerical inputs, creating a representation gap that practitioners must bridge.
For gradient boosting specifically, categorical handling is both critically important and surprisingly nuanced. Different boosting libraries implement different categorical handling strategies, each with distinct tradeoffs. Choosing the wrong approach can dramatically hurt model performance or training efficiency.
By the end of this page, you will understand the full landscape of categorical encoding methods, how XGBoost, LightGBM, and CatBoost each handle categoricals natively, when to use which encoding strategy, and how to handle high-cardinality and hierarchical categoricals effectively.
| Method | Cardinality | Pros | Cons |
|---|---|---|---|
| One-Hot | Low (<20) | No information loss, interpretable | Dimensionality explosion, sparse |
| Label/Ordinal | Any | Single column, efficient | Imposes false ordering |
| Target Encoding | High | Captures predictive signal | Requires regularization (leakage risk) |
| Frequency Encoding | High | No leakage, captures popularity | No target signal |
| Binary Encoding | Medium | Compact (log₂k columns) | Loses some information |
| Hash Encoding | Very High | Fixed dimensionality | Collisions, not invertible |
| Embedding | Very High | Learned representations | Requires neural network training |
Ordinal vs. Nominal Categoricals:
Not all categorical features are the same:
For ordinal variables, label encoding respects the underlying order:
education_order = {'High School': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4}
df['education_encoded'] = df['education'].map(education_order)
For nominal variables, label encoding imposes a false ordering that can mislead the model. One-hot or specialized encodings are preferred.
Modern boosting libraries provide native categorical handling that often outperforms manual encoding.
LightGBM Native Categoricals:
LightGBM finds optimal categorical splits by considering all possible partitions of categories into two groups. For efficiency, it uses a sorting-based algorithm:
import lightgbm as lgb
# Specify categorical features by name or index
model = lgb.LGBMClassifier(
categorical_feature=['city', 'product_type', 'device']
)
# Or using Dataset API
train_data = lgb.Dataset(
data=X_train, label=y_train,
categorical_feature=['city', 'product_type']
)
CatBoost Native Categoricals:
CatBoost uses ordered target statistics (covered in target encoding page) computed dynamically during training:
import catboost as cb
model = cb.CatBoostClassifier(
cat_features=['city', 'product_type', 'device'],
one_hot_max_size=10 # One-hot for low cardinality
)
# CatBoost handles encoding automatically
model.fit(X_train, y_train)
XGBoost Native Categoricals (v1.5+):
XGBoost added experimental categorical support using an optimal partitioning algorithm:
import xgboost as xgb
# Enable categorical feature support
model = xgb.XGBClassifier(
tree_method='hist',
enable_categorical=True
)
# Ensure columns are pandas Categorical dtype
X_train['city'] = X_train['city'].astype('category')
Use native handling when: Categories number between 10-1000, you want optimal splitting, and using LightGBM or CatBoost. Use manual encoding when: Categories exceed 1000 (hash/target encoding), you need reproducibility across different libraries, or you're adding domain knowledge through encoding choices.
High-cardinality features (1000+ categories) require special handling. One-hot encoding is infeasible, and even native categorical support may struggle.
Strategy 1: Grouping/Binning
Reduce cardinality by grouping rare or similar categories:
# Group by frequency threshold
value_counts = df['category'].value_counts()
top_categories = value_counts[value_counts >= 100].index
df['category_grouped'] = df['category'].where(
df['category'].isin(top_categories), 'OTHER'
)
# Group by hierarchy (if available)
# ZIP codes -> first 3 digits (regions)
df['region'] = df['zip_code'].str[:3]
Strategy 2: Hashing
Map categories to fixed-size hash buckets:
from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher(n_features=256, input_type='string')
X_hashed = hasher.fit_transform(df['category'].apply(lambda x: [x]))
Pros: Fixed dimensionality, handles unseen categories Cons: Collisions (different categories map to same bucket), not reversible
12345678910111213141516171819202122232425262728293031323334353637383940414243
import pandas as pdimport numpy as npfrom sklearn.model_selection import KFold def encode_high_cardinality(df, cat_col, target_col, strategy='combined'): """ Comprehensive encoding for high-cardinality categorical features. Creates multiple derived features capturing different aspects. """ result = df.copy() if strategy == 'combined': # 1. Target encoding with K-fold regularization kf = KFold(n_splits=5, shuffle=True, random_state=42) result[f'{cat_col}_target'] = np.nan for train_idx, val_idx in kf.split(df): train_mean = df.iloc[train_idx].groupby(cat_col)[target_col].mean() global_mean = df.iloc[train_idx][target_col].mean() # Smoothed encoding counts = df.iloc[train_idx].groupby(cat_col).size() smoothed = (counts * train_mean + 10 * global_mean) / (counts + 10) result.loc[df.index[val_idx], f'{cat_col}_target'] = df.iloc[val_idx][cat_col].map(smoothed) # Fill NaN with global mean result[f'{cat_col}_target'].fillna(df[target_col].mean(), inplace=True) # 2. Frequency encoding freq = df[cat_col].value_counts() result[f'{cat_col}_freq'] = df[cat_col].map(freq) result[f'{cat_col}_freq_log'] = np.log1p(result[f'{cat_col}_freq']) # 3. Rare indicator result[f'{cat_col}_is_rare'] = (result[f'{cat_col}_freq'] < 10).astype(int) # 4. Category rank (by frequency) rank = freq.rank(ascending=False, method='dense') result[f'{cat_col}_rank'] = df[cat_col].map(rank) return resultChoosing the right encoding depends on cardinality, interpretability needs, and available compute.
| Cardinality | Ordinal? | Recommended Approach |
|---|---|---|
| < 5 | No | One-hot encoding |
| < 5 | Yes | Label encoding with explicit order |
| 5-20 | No | One-hot or native categorical |
| 5-20 | Yes | Label encoding |
| 20-100 | No | Native categorical (LightGBM/CatBoost) |
| 100-1000 | No | Target + frequency encoding combination |
1000 | No | Target encoding + hashing/grouping |
10000 | No | Hashing or learned embeddings |
Many categorical features have natural hierarchies:
Leveraging Hierarchies:
def create_hierarchical_features(df):
'''Create features at multiple hierarchy levels.'''
# Product hierarchy
df['department'] = df['product_id'].map(get_department)
df['category'] = df['product_id'].map(get_category)
# Encode each level
for level in ['department', 'category', 'product_id']:
# Target encoding at each level
df[f'{level}_target'] = target_encode(df[level])
# Frequency at each level
df[f'{level}_freq'] = df[level].map(df[level].value_counts())
return df
This creates a cascade of features from general to specific, allowing the model to learn both broad patterns and specific details.
Higher-level categories provide regularization for rare specific categories. If a specific product is new, its category and department encodings still provide signal. This is similar to shrinkage toward a group mean in hierarchical models.
You now have a comprehensive framework for categorical feature handling in gradient boosting. Next, we explore feature selection—identifying which features (including engineered categoricals) provide the most value for your boosting model.