Catboost - Learning Module

Loading content...

0/245

Categorical Feature Handling

The Categorical Feature Challenge in Machine Learning

Categorical features are ubiquitous in real-world data: product categories, user demographics, geographic regions, browser types, payment methods. Yet tree-based models, despite their power, have historically struggled with categorical data—forcing practitioners to resort to preprocessing transformations that often introduce their own problems.

CatBoost (short for "Categorical Boosting") was designed from the ground up to handle categorical features natively. Its approach goes far beyond convenience—it implements a mathematically principled encoding scheme that avoids target leakage while capturing complex relationships between categories and targets.

This capability alone makes CatBoost the preferred choice for many production ML systems where categorical features dominate.

What Makes This Difficult

The fundamental challenge is that standard tree algorithms split on feature values. For numerical features, this means finding optimal thresholds. For categorical features with many levels (e.g., 10,000 cities), there are 2^10000 possible splits—an astronomically infeasible search space.

Traditional Approaches and Their Problems

Before understanding CatBoost's innovation, we must understand why traditional approaches to categorical features fall short.

One-Hot Encoding

The most common approach is one-hot encoding: create a binary feature for each category level.

Advantages:

Pure transformation with no information leakage
Works with any algorithm
Interpretable

Problems:

Dimensionality explosion: A city feature with 10,000 cities becomes 10,000 columns
Sparse splits: Tree splits become inefficient—most samples have 0 for most columns
Information fragmentation: Related categories are treated as completely independent
Memory overhead: Sparse matrices required for large cardinalities
Cold start: Unseen categories at inference time cause failures

Impact of One-Hot Encoding on Feature Dimensionality
Feature	Cardinality	One-Hot Columns	Typical Issues
Country	~200	200	Manageable but fragments geography
City	~10,000	10,000	Memory explosion; sparse trees
Product ID	~1,000,000	1,000,000	Completely infeasible
User ID	~100,000,000	N/A	Impossible; must aggregate
IP Address	~4,000,000,000	N/A	Must hash or bin; loses precision

Label/Ordinal Encoding

Assign each category an integer: 0, 1, 2, ..., k-1.

Advantages:

Single column regardless of cardinality
Memory efficient
Simple to implement

Problems:

Artificial ordering: Implies Tokyo (5) > Paris (3) > London (1) when no order exists
Nonsensical splits: Tree splits like "city < 127" mix unrelated categories
Misleading patterns: Models learn spurious relationships based on arbitrary numbering

The Label Encoding Problem

Label encoding works for ordinal categories (Small < Medium < Large) but creates severe problems for nominal categories. A tree split on 'City < 50' makes no semantic sense—it's dividing cities based on arbitrary IDs, not meaningful similarity.

Hashing/Feature Hashing

Hash category strings to fixed-size integers, typically modulo some bucket count.

Advantages:

Fixed output dimensionality regardless of cardinality
Handles unseen categories automatically
Memory bounded

Problems:

Hash collisions: Different categories map to same bucket, losing information
No semantic structure: Similar categories don't get similar hashes
Irreversible: Cannot recover original category from hash
Collision rate increases with cardinality: More categories = more collisions

Traditional Categorical Encoding Comparison
Method	Columns Created	Preserves Info	Target Leakage	Unseen Handling
One-Hot	k	Perfect	None	Fails (new column needed)
Label Encoding	1	Distorted (false ordering)	None	Fails (new ID needed)
Hash Encoding	Fixed	Lossy (collisions)	None	Graceful (maps to bucket)
Target Encoding	1	Good (mean target)	HIGH RISK	Graceful (global mean)

Target Encoding and Its Perils

Target encoding (also called mean encoding or likelihood encoding) is conceptually elegant: replace each category with the mean target value for samples in that category.

For a categorical feature $c$ with level $c_j$, the target encoding is: $$TE(c_j) = \mathbb{E}[Y | C = c_j] \approx \frac{\sum_{i: c_i = c_j} y_i}{\sum_{i: c_i = c_j} 1}$$

This approach is powerful because:

It creates a single numerical feature with semantic meaning
Categories with high target means get high encodings; low means get low encodings
Tree splits on this feature naturally separate high-target from low-target categories

The Fatal Flaw: Target Leakage

The problem is devastating: when you compute $TE(c_j)$ using all training samples in category $c_j$, you're using the target $y_i$ of sample $i$ to create a feature that will be used to predict $y_i$.

This is pure information leakage—the feature literally contains the answer.

Why This Is So Dangerous

Target leakage in encoding is especially insidious because it affects every categorical feature, not just the obviously problematic ones. A rare category with 5 samples will have an encoding that essentially memorizes those 5 targets—giving the model perfect knowledge of their outcomes from a 'feature' that has no predictive power on new data.

Case Study: Target Leakage in Practice

Consider predicting customer churn with a 'product_id' feature:

Product A has 5 customers in training: 4 churned, 1 didn't → $TE = 0.8$
At inference, a new customer with Product A sees the feature 0.8
The model learned: "high encoding → high churn"
But 0.8 came from 5 samples—it's noise, not signal

The naive target encoding makes Product A's 5 training customers appear highly predictable (their encodings directly encode their outcomes), but provides no true predictive power for new Product A customers.

Mitigation Attempts

Practitioners have tried various fixes:

Leave-one-out encoding: Compute mean excluding the current sample
- Reduces but doesn't eliminate leakage
- Still uses n-1 samples that include nearby observations
Cross-validation encoding: Compute means from other CV folds
- Better isolation but complex to implement
- Doesn't help within-fold contamination
Additive smoothing: Blend with global mean: $\frac{n \cdot \hat{\mu}c + \alpha \cdot \mu{global}}{n + \alpha}$
- Shrinks rare categories toward prior
- Still fundamentally uses targets to create features

target_leakage_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
 
def demonstrate_target_leakage():
    """
    Shows how naive target encoding causes dramatic overfitting.
    
    We create a dataset where the categorical feature has NO predictive
    power, yet target encoding makes it appear highly predictive.
    """
    np.random.seed(42)
    n_samples = 1000
    n_categories = 200
    
    # Random categorical feature with NO relationship to target
    categories = np.random.randint(0, n_categories, n_samples)
    # Target is pure noise - category provides zero information
    y = np.random.randint(0, 2, n_samples)
    
    # === Method 1: Naive Target Encoding ===
    # Compute mean target per category across ALL data (LEAKAGE!)
    category_means = pd.Series(y).groupby(categories).transform('mean')
    X_leaky = category_means.values.reshape(-1, 1)
    
    # This will appear to have signal, but it's pure leakage
    leaky_cv_scores = cross_val_score(
        GradientBoostingClassifier(n_estimators=50, random_state=42),
        X_leaky, y, cv=5, scoring='roc_auc'
    )
    
    # === Method 2: Proper Leave-One-Out within CV ===
    # More proper approach: encode each fold using only training fold data
    X_proper = np.zeros((n_samples, 1))
    
    from sklearn.model_selection import KFold
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    
    for train_idx, val_idx in kf.split(X_leaky):
        # Compute means only from training fold
        train_categories = categories[train_idx]
        train_y = y[train_idx]
        
        means = {}
        for cat in np.unique(train_categories):
            mask = train_categories == cat
            means[cat] = train_y[mask].mean() if mask.sum() > 0 else 0.5
        
        # Apply to validation fold
        global_mean = train_y.mean()
        for idx in val_idx:
            X_proper[idx, 0] = means.get(categories[idx], global_mean)
    
    proper_cv_scores = cross_val_score(
        GradientBoostingClassifier(n_estimators=50, random_state=42),
        X_proper, y, cv=5, scoring='roc_auc'
    )
    
    print("Target Leakage Demonstration")
    print("=" * 50)
    print(f"Ground truth: Categories have ZERO predictive power")
    print(f"Expected AUC: 0.50 (random chance)")
    print()
    print(f"Naive target encoding AUC: {leaky_cv_scores.mean():.3f}")
    print(f"  ^ Appears predictive due to leakage!")
    print()
    print(f"Proper fold-isolated encoding AUC: {proper_cv_scores.mean():.3f}")
    print(f"  ^ Correctly shows no predictive power")
 
demonstrate_target_leakage()

CatBoost's Ordered Target Statistics

CatBoost solves target leakage in categorical encoding using the same ordered-processing concept from ordered boosting. The result is ordered target statistics—a principled target encoding that never uses a sample's target to encode that sample's features.

The Ordered Target Statistics Algorithm

For a random permutation $\sigma$ of training samples and categorical feature $c$:

$$TS_i = \frac{\sum_{j: \sigma(j) < \sigma(i), c_j = c_i} y_j + a \cdot P}{\sum_{j: \sigma(j) < \sigma(i), c_j = c_i} 1 + a}$$

Where:

$\sigma(i)$ is sample $i$'s position in the permutation
$c_i$ is sample $i$'s category value
$a$ is a smoothing parameter (prior weight)
$P$ is the prior (typically global target mean)

Breaking Down the Formula

For sample $i$ with category "Paris":

Look at all samples appearing before $i$ in the permutation
Filter to those also in category "Paris"
Compute their mean target value
Blend with global prior for regularization

Critically: sample $i$'s own target $y_i$ is never used in computing $TS_i$.

Why This Eliminates Leakage

Because we only use samples appearing before position $\sigma(i)$, and sample $i$'s target $y_i$ doesn't influence those samples' encoding, there's no circular dependency. The encoding for sample $i$ is computed from data that had no knowledge of $y_i$.

Visual Walkthrough

Consider samples in order after permutation:

Position	Sample	Category	Target	Samples Before (same cat)	Target Stats
1	A	Red	1	(none)	Prior = 0.5
2	B	Blue	0	(none)	Prior = 0.5
3	C	Red	0	A (Red, y=1)	(1 + 0.5·0.5)/(1+0.5) = 0.83
4	D	Red	1	A,C (Red)	(1+0 + 0.5·0.5)/(2+0.5) = 0.5
5	E	Blue	1	B (Blue, y=0)	(0 + 0.5·0.5)/(1+0.5) = 0.17

Notice:

Sample A has no preceding Red samples → uses prior
Sample C sees only A's target when computing its encoding
Sample D sees A and C's targets, but not its own
Each sample gets a unique encoding based on its position

ordered_target_statistics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from typing import Dict, List
 
def compute_ordered_target_statistics(
    categories: np.ndarray,
    targets: np.ndarray,
    prior: float = 0.5,
    prior_weight: float = 1.0,
    random_seed: int = 42
) -> np.ndarray:
    """
    Compute ordered target statistics as used by CatBoost.
    
    For each sample, the encoding uses only targets from samples appearing
    earlier in a random permutation that share the same category.
    
    Parameters:
    -----------
    categories : array of shape (n_samples,)
        Categorical feature values
    targets : array of shape (n_samples,)
        Target values (0/1 for classification, continuous for regression)
    prior : float
        Prior probability/value for regularization
    prior_weight : float
        Weight of prior (higher = more shrinkage toward prior)
    random_seed : int
        Random seed for permutation reproducibility
        
    Returns:
    --------
    encodings : array of shape (n_samples,)
        Ordered target statistics for each sample
    """
    np.random.seed(random_seed)
    n_samples = len(categories)
    
    # Generate random permutation
    permutation = np.random.permutation(n_samples)
    
    # Track running statistics per category
    # sum_targets[cat] = sum of targets seen so far for category cat
    # count[cat] = number of samples seen so far for category cat
    sum_targets: Dict[int, float] = {}
    count: Dict[int, int] = {}
    
    # Encodings for each sample (in original order)
    encodings = np.zeros(n_samples)
    
    # Process samples in permutation order
    for position in range(n_samples):
        sample_idx = permutation[position]
        cat = categories[sample_idx]
        target = targets[sample_idx]
        
        # Compute encoding using statistics BEFORE this sample
        cat_sum = sum_targets.get(cat, 0.0)
        cat_count = count.get(cat, 0)
        
        # Ordered target statistic with prior smoothing
        encodings[sample_idx] = (
            (cat_sum + prior_weight * prior) / 
            (cat_count + prior_weight)
        )
        
        # Update running statistics to include this sample
        # (for subsequent samples in the permutation)
        sum_targets[cat] = cat_sum + target
        count[cat] = cat_count + 1
    
    return encodings
 
 
def verify_no_leakage(categories: np.ndarray, targets: np.ndarray):
    """
    Verify that ordered target statistics don't cause leakage.
    
    For a random target with no relationship to categories,
    ordered encoding should show no predictive power.
    """
    from sklearn.model_selection import cross_val_score
    from sklearn.ensemble import GradientBoostingClassifier
    
    # Compute ordered target statistics
    encodings = compute_ordered_target_statistics(
        categories, targets, prior=0.5, prior_weight=1.0
    )
    
    X = encodings.reshape(-1, 1)
    
    # Cross-validate
    scores = cross_val_score(
        GradientBoostingClassifier(n_estimators=50, random_state=42),
        X, targets, cv=5, scoring='roc_auc'
    )
    
    print(f"Ordered Target Statistics AUC: {scores.mean():.3f} (+/- {scores.std():.3f})")
    print("Expected: ~0.50 for random targets (no leakage)")
    
    return scores.mean()
 
 
# Demonstrate with random data
if __name__ == "__main__":
    np.random.seed(42)
    n_samples = 1000
    categories = np.random.randint(0, 100, n_samples)
    targets = np.random.randint(0, 2, n_samples)  # Random targets
    
    print("Verification: Ordered encoding eliminates leakage")
    print("=" * 55)
    verify_no_leakage(categories, targets)

Handling High-Cardinality Categories

CatBoost's ordered target statistics shine in high-cardinality scenarios where other methods fail. Consider handling user IDs, product SKUs, or IP addresses—features with thousands to millions of unique values.

The Prior Smoothing Mechanism

The prior weight parameter $a$ in the target statistics formula isn't just regularization—it's essential for handling rare categories:

$$TS_i = \frac{\sum_{j: \sigma(j) < \sigma(i), c_j = c_i} y_j + a \cdot P}{\sum_{j: \sigma(j) < \sigma(i), c_j = c_i} 1 + a}$$

For a category seen only once before the current sample:

Numerator: $1 \cdot y_\text{prev} + a \cdot P$
Denominator: $1 + a$

If $a = 1$, the encoding is halfway between the single observed target and the prior. This prevents extreme encodings based on minimal data.

Behavior Analysis by Category Frequency

Target Statistics Behavior by Category Size (prior=0.5, a=1)
Category Size	Preceding Samples	Mean Target	Computed Encoding	Shrinkage Effect
0 (unseen)	None	N/A	0.50	100% prior (no data)
1	1 sample, y=1	1.0	0.75	50% shrinkage toward prior
5	All y=1	1.0	0.92	20% shrinkage
50	Mean 0.8	0.8	0.79	2% shrinkage
500	Mean 0.8	0.8	0.80	0.2% shrinkage (data dominates)

New Categories at Inference Time

A major advantage of target statistics: unseen categories gracefully default to the prior rather than failing:

During training: Category "NewProduct" never appears → no encoding computed
At inference: Sample has "NewProduct" → encoded as prior $P$
Model behavior: Treats new categories as "average" items

This contrasts sharply with one-hot encoding, which cannot represent unseen categories without retraining.

Multi-Level Combinations

CatBoost automatically generates combinations of categorical features when beneficial:

Individual features: Category A, Category B
Pair combination: (Category A, Category B) as single joint category
Higher-order: Triples, etc., up to specified limits

For example, with "Country" and "Browser":

Individual: Country encoding, Browser encoding
Combined: (Country, Browser) joint encoding

This captures interactions like "Chrome users from Japan behave differently than Chrome users from Brazil" without explicit feature engineering.

catboost_categorical_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
from catboost import CatBoostClassifier, Pool
import pandas as pd
import numpy as np
 
# Create sample data with high-cardinality categoricals
def create_sample_data():
    np.random.seed(42)
    n_samples = 10000
    
    data = {
        'user_id': [f'user_{i % 5000}' for i in range(n_samples)],
        'product_category': np.random.choice(
            ['electronics', 'clothing', 'food', 'books', 'sports'],
            n_samples
        ),
        'city': np.random.choice(
            [f'city_{i}' for i in range(200)],  # 200 cities
            n_samples
        ),
        'browser': np.random.choice(
            ['chrome', 'firefox', 'safari', 'edge', 'other'],
            n_samples
        ),
        'numeric_feature': np.random.randn(n_samples),
        'target': (np.random.rand(n_samples) > 0.7).astype(int)
    }
    return pd.DataFrame(data)
 
df = create_sample_data()
train_df = df.iloc[:8000]
test_df = df.iloc[8000:]
 
# Identify categorical columns
cat_features = ['user_id', 'product_category', 'city', 'browser']
 
# CatBoost handles categoricals natively - no preprocessing needed
model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    
    # ===========================
    # Categorical Feature Config
    # ===========================
    
    # Identify categorical feature indices or names
    cat_features=cat_features,
    
    # One-hot encoding threshold: categories with <= this many levels
    # are one-hot encoded; others use target statistics
    one_hot_max_size=10,  # One-hot for features with ≤10 categories
    
    # Maximum number of categorical feature combinations to consider
    max_ctr_complexity=2,  # Up to pairs of categoricals
    
    # Prior for target statistics regularization
    # Higher = more shrinkage toward prior
    simple_ctr=[
        'Borders:Prior=0.5:PriorWeight=1.0',
    ],
    
    random_seed=42,
    verbose=100,
)
 
# Create Pool objects (CatBoost's data container)
train_pool = Pool(
    train_df.drop('target', axis=1),
    train_df['target'],
    cat_features=cat_features
)
 
test_pool = Pool(
    test_df.drop('target', axis=1),
    test_df['target'],
    cat_features=cat_features
)
 
# Train - CatBoost automatically:
# 1. Computes ordered target statistics per permutation
# 2. Generates categorical feature combinations
# 3. Uses symmetric trees for efficient processing
model.fit(train_pool, eval_set=test_pool, early_stopping_rounds=50)
 
# Examine feature importance including categorical combinations
feature_importance = model.get_feature_importance(
    train_pool,
    type='PredictionValuesChange'
)
 
feature_names = model.feature_names_
print("\nTop Features by Importance:")
for name, importance in sorted(
    zip(feature_names, feature_importance),
    key=lambda x: -x[1]
)[:10]:
    print(f"  {name}: {importance:.3f}")
 
# Handle new categories at inference
# Create a sample with previously unseen categories
new_sample = pd.DataFrame({
    'user_id': ['completely_new_user'],  # Never seen in training
    'product_category': ['new_category'],  # Never seen
    'city': ['city_999'],  # Never seen
    'browser': ['chrome'],  # Seen
    'numeric_feature': [0.5]
})
 
# CatBoost handles this gracefully - no crash, uses priors
prediction = model.predict_proba(new_sample)
print(f"\nPrediction for unseen categories: {prediction}")
print("(Graceful fallback to prior - no errors)")

Advanced Categorical Configurations

CatBoost provides extensive control over categorical feature processing through CTR (Click-Through Rate) configurations. The name comes from CatBoost's origins in Yandex's advertising systems, but the mechanisms apply to any classification or regression task.

CTR Types

CatBoost offers multiple target statistic types:

Borders (default): Standard ordered target statistics with borders for numerical discretization
Buckets: Discretize target statistics into fixed buckets
Counter: Use raw counts instead of means (useful for frequency-based features)
BinarizedTargetMeanValue: For regression targets, use mean value encoding

CTR Configuration Parameters

CTR Configuration Options
Parameter	Range	Default	Effect
Prior	[0, 1]	Mean target	Center for shrinkage on small categories
PriorWeight	(0, ∞)	1.0	Strength of shrinkage toward prior
MaxCtrComplexity	1-8	4	Max features in categorical combination
TargetBorderCount	1-255	1	Borders for binarizing regression targets
CtrBorderCount	1-255	15	Borders for discretizing CTR values

Feature-Specific CTR Settings

For datasets with heterogeneous categorical features, you can specify different CTR settings per feature:

model = CatBoostClassifier(
    per_feature_ctr={
        0: ['Borders:TargetBorderCount=1:PriorWeight=0.5'],  # Feature 0
        1: ['Counter'],  # Feature 1: use counts instead of means
        2: ['Borders:TargetBorderCount=5'],  # Feature 2: more granular
    }
)

Combining CTR Types

Multiple CTR types can be combined for the same feature, generating multiple derived features:

simple_ctr=[
    'Borders:TargetBorderCount=1',
    'Counter'
]

This creates two encodings per categorical feature: one target-mean based, one count-based. The model can learn to use whichever is more predictive.

Practical Guidance

For most applications, default CTR settings work well. Consider tuning only when: (1) you have domain knowledge about specific features, (2) extreme category imbalance exists, or (3) you're optimizing for maximum model performance and have compute for hyperparameter search.

advanced_ctr_configuration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
from catboost import CatBoostClassifier, Pool
import numpy as np
import pandas as pd
 
# Example: Configuring CTR for e-commerce prediction
def configure_advanced_ctr():
    """
    Demonstrates advanced CTR configuration for a realistic
    e-commerce purchase prediction problem.
    """
    
    model = CatBoostClassifier(
        iterations=1000,
        learning_rate=0.03,
        depth=6,
        
        # ===========================================
        # General CTR Configuration
        # ===========================================
        
        # Max combinations of categorical features
        # Higher = more expressive but slower and risk of overfitting
        max_ctr_complexity=3,  # Up to triples
        
        # One-hot threshold: use one-hot for low-cardinality
        one_hot_max_size=5,    # One-hot if ≤5 categories
        
        # ===========================================
        # Simple CTR: Applied to individual features
        # ===========================================
        simple_ctr=[
            # Primary: Target mean with moderate shrinkage
            'Borders:TargetBorderCount=1:PriorWeight=1.0',
            
            # Secondary: Frequency-based (useful for popularity signals)
            'Counter',
        ],
        
        # ===========================================
        # Combination CTR: Applied to feature combinations
        # ===========================================
        combinations_ctr=[
            # For combinations, use stronger shrinkage
            # (combinations have less data per category)
            'Borders:TargetBorderCount=1:PriorWeight=2.0',
        ],
        
        # ===========================================
        # Per-Feature CTR Customization
        # ===========================================
        # Assuming: 0=user_id, 1=category, 2=brand, 3=device
        per_feature_ctr={
            # User ID: Very high cardinality, aggressive shrinkage
            0: ['Borders:PriorWeight=5.0'],
            
            # Category: Moderate cardinality, standard config
            1: ['Borders:PriorWeight=1.0', 'Counter'],
            
            # Brand: May have popularity effects
            2: ['Borders:PriorWeight=1.0', 'Counter'],
            
            # Device: Low cardinality, will be one-hot encoded anyway
            3: [],  # Use one-hot only
        },
        
        # ===========================================
        # Regularization for overfitting prevention
        # ===========================================
        l2_leaf_reg=3.0,
        random_strength=0.5,
        bagging_temperature=0.3,
        
        random_seed=42,
        verbose=100,
    )
    
    return model
 
 
def analyze_ctr_features(model, pool):
    """
    Analyze how CatBoost created CTR-based features.
    """
    # Get all feature importances including CTR combinations
    importance = model.get_feature_importance(pool)
    feature_names = model.feature_names_
    
    # Categorize features
    original_features = []
    ctr_features = []
    combination_features = []
    
    for name, imp in zip(feature_names, importance):
        if '{' in name:  # CTR combination features contain braces
            combination_features.append((name, imp))
        elif ':' in name:  # CTR transforms contain colons
            ctr_features.append((name, imp))
        else:
            original_features.append((name, imp))
    
    print("Feature Importance Analysis")
    print("=" * 60)
    
    print("\nOriginal Features:")
    for name, imp in sorted(original_features, key=lambda x: -x[1])[:5]:
        print(f"  {name}: {imp:.3f}")
    
    print("\nCTR-transformed Features:")
    for name, imp in sorted(ctr_features, key=lambda x: -x[1])[:5]:
        print(f"  {name}: {imp:.3f}")
    
    print("\nCombination Features:")
    for name, imp in sorted(combination_features, key=lambda x: -x[1])[:5]:
        print(f"  {name}: {imp:.3f}")
 
# Usage example (with appropriate data)
model = configure_advanced_ctr()
print("Model configured with advanced CTR settings")
print(f"Simple CTR: {model.get_params().get('simple_ctr')}")
print(f"Combinations CTR: {model.get_params().get('combinations_ctr')}")

Categorical Handling: CatBoost vs XGBoost vs LightGBM

Understanding how CatBoost's categorical handling compares to XGBoost and LightGBM helps practitioners choose the right tool for their data.

XGBoost's Approach

Until recently, XGBoost had no native categorical support—all preprocessing was the user's responsibility. XGBoost 1.6+ added experimental categorical support:

Mechanism: Optimal partitioning of categories based on gradient statistics Pros: No target leakage by construction (uses gradients, not raw targets) Cons:

Limited to features explicitly marked categorical
No automatic combinations
Still less mature than CatBoost's implementation

LightGBM's Approach

LightGBM introduced native categorical support before XGBoost:

Mechanism: Gradient-based one-side sampling for category splitting Pros:

Fast and GPU-accelerated
No explicit target encoding (uses split optimization)
Handles moderate cardinality well

Cons:

Can struggle with very high cardinality (>1000 categories)
No ordered processing for leakage prevention
User must specify categorical features explicitly

Categorical Feature Handling Comparison
Aspect	CatBoost	LightGBM	XGBoost
Native support	Core feature since v1	Added 2017	Experimental since v1.6
Encoding method	Ordered target statistics	Gradient-based splits	Optimal partitioning
Target leakage	Eliminated by design	Not directly addressed	Avoided (uses gradients)
High cardinality (>1K)	Excellent	Moderate	Limited
Feature combinations	Automatic generation	Manual required	Manual required
New categories	Graceful (prior)	Problematic	Problematic
Configuration	Extensive CTR options	Basic on/off	Limited
Speed impact	Adds overhead for ordering	Minimal	Moderate

When to Choose CatBoost for Categoricals

Choose CatBoost when: (1) categorical features dominate your data, (2) cardinality is high (>100 categories), (3) you want automatic feature combinations, (4) you cannot afford target leakage, or (5) you expect new categories at inference time. Choose competitors when speed is paramount and your categorical preprocessing is already robust.

Benchmark: Categorical Performance

On datasets with significant categorical content, CatBoost typically shows:

Accuracy advantage: 0.5-2% improvement in AUC on categorical-heavy datasets
Training time: 10-30% slower than LightGBM due to ordered processing
Memory: Slightly higher due to storing multiple permutations
Convenience: Significantly less preprocessing code required

The training time penalty is often acceptable given:

Reduced time spent on feature engineering
Fewer data leakage bugs to debug
Better out-of-box performance
Simpler production pipelines (no preprocessing step)

Real-World Impact

In production systems at companies like Yandex, Cloudflare, and others, CatBoost's categorical handling has eliminated entire preprocessing pipelines—reducing code complexity, deployment risk, and engineering maintenance burden.

Summary and Key Takeaways

CatBoost's categorical feature handling represents a significant advancement in practical machine learning. By combining ordered processing with sophisticated target statistics, it eliminates target leakage while providing robust handling for high-cardinality features.

Key Takeaways

•Traditional encodings have fundamental limitations: One-hot explodes dimensions, label encoding creates false orderings, and naive target encoding causes severe data leakage.
•Ordered target statistics eliminate leakage: By using only samples appearing earlier in a random permutation, each sample's encoding never sees its own target.
•Prior smoothing handles rare categories: The blend with global prior prevents extreme encodings from small samples and gracefully handles unseen categories.
•Automatic feature combinations: CatBoost generates categorical interactions without manual feature engineering, capturing complex relationships.
•Extensive configuration options: CTR types, per-feature settings, and combination complexity allow fine-tuning for specific use cases.
•Competitive advantage over other frameworks: Native categorical support with leakage prevention is CatBoost's distinguishing strength.

Looking Ahead

Next, we'll dive deeper into target statistics—exploring the mathematical foundations of different CTR types and how to optimize them for specific prediction tasks.

Categorical Feature Handling

The Categorical Feature Challenge in Machine Learning

This capability alone makes CatBoost the preferred choice for many production ML systems where categorical features dominate.

What Makes This Difficult

Traditional Approaches and Their Problems

Before understanding CatBoost's innovation, we must understand why traditional approaches to categorical features fall short.

One-Hot Encoding

The most common approach is one-hot encoding: create a binary feature for each category level.

Advantages:

Pure transformation with no information leakage
Works with any algorithm
Interpretable

Problems:

Dimensionality explosion: A city feature with 10,000 cities becomes 10,000 columns
Sparse splits: Tree splits become inefficient—most samples have 0 for most columns
Information fragmentation: Related categories are treated as completely independent
Memory overhead: Sparse matrices required for large cardinalities
Cold start: Unseen categories at inference time cause failures

Impact of One-Hot Encoding on Feature Dimensionality
Feature	Cardinality	One-Hot Columns	Typical Issues
Country	~200	200	Manageable but fragments geography
City	~10,000	10,000	Memory explosion; sparse trees
Product ID	~1,000,000	1,000,000	Completely infeasible
User ID	~100,000,000	N/A	Impossible; must aggregate
IP Address	~4,000,000,000	N/A	Must hash or bin; loses precision

Label/Ordinal Encoding

Assign each category an integer: 0, 1, 2, ..., k-1.

Advantages:

Single column regardless of cardinality
Memory efficient
Simple to implement

Problems:

Artificial ordering: Implies Tokyo (5) > Paris (3) > London (1) when no order exists
Nonsensical splits: Tree splits like "city < 127" mix unrelated categories
Misleading patterns: Models learn spurious relationships based on arbitrary numbering

The Label Encoding Problem

Hashing/Feature Hashing

Hash category strings to fixed-size integers, typically modulo some bucket count.

Advantages:

Fixed output dimensionality regardless of cardinality
Handles unseen categories automatically
Memory bounded

Problems:

Hash collisions: Different categories map to same bucket, losing information
No semantic structure: Similar categories don't get similar hashes
Irreversible: Cannot recover original category from hash
Collision rate increases with cardinality: More categories = more collisions

Traditional Categorical Encoding Comparison
Method	Columns Created	Preserves Info	Target Leakage	Unseen Handling
One-Hot	k	Perfect	None	Fails (new column needed)
Label Encoding	1	Distorted (false ordering)	None	Fails (new ID needed)
Hash Encoding	Fixed	Lossy (collisions)	None	Graceful (maps to bucket)
Target Encoding	1	Good (mean target)	HIGH RISK	Graceful (global mean)

Target Encoding and Its Perils

Target encoding (also called mean encoding or likelihood encoding) is conceptually elegant: replace each category with the mean target value for samples in that category.

For a categorical feature $c$ with level $c_j$, the target encoding is: $$TE(c_j) = \mathbb{E}[Y | C = c_j] \approx \frac{\sum_{i: c_i = c_j} y_i}{\sum_{i: c_i = c_j} 1}$$

This approach is powerful because:

It creates a single numerical feature with semantic meaning
Categories with high target means get high encodings; low means get low encodings
Tree splits on this feature naturally separate high-target from low-target categories

The Fatal Flaw: Target Leakage

This is pure information leakage—the feature literally contains the answer.

Why This Is So Dangerous

Case Study: Target Leakage in Practice

Consider predicting customer churn with a 'product_id' feature:

Product A has 5 customers in training: 4 churned, 1 didn't → $TE = 0.8$
At inference, a new customer with Product A sees the feature 0.8
The model learned: "high encoding → high churn"
But 0.8 came from 5 samples—it's noise, not signal

Mitigation Attempts

Practitioners have tried various fixes:

Leave-one-out encoding: Compute mean excluding the current sample
- Reduces but doesn't eliminate leakage
- Still uses n-1 samples that include nearby observations
Cross-validation encoding: Compute means from other CV folds
- Better isolation but complex to implement
- Doesn't help within-fold contamination
Additive smoothing: Blend with global mean: $\frac{n \cdot \hat{\mu}c + \alpha \cdot \mu{global}}{n + \alpha}$
- Shrinks rare categories toward prior
- Still fundamentally uses targets to create features

target_leakage_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
 
def demonstrate_target_leakage():
    """
    Shows how naive target encoding causes dramatic overfitting.
    
    We create a dataset where the categorical feature has NO predictive
    power, yet target encoding makes it appear highly predictive.
    """
    np.random.seed(42)
    n_samples = 1000
    n_categories = 200
    
    # Random categorical feature with NO relationship to target
    categories = np.random.randint(0, n_categories, n_samples)
    # Target is pure noise - category provides zero information
    y = np.random.randint(0, 2, n_samples)
    
    # === Method 1: Naive Target Encoding ===
    # Compute mean target per category across ALL data (LEAKAGE!)
    category_means = pd.Series(y).groupby(categories).transform('mean')
    X_leaky = category_means.values.reshape(-1, 1)
    
    # This will appear to have signal, but it's pure leakage
    leaky_cv_scores = cross_val_score(
        GradientBoostingClassifier(n_estimators=50, random_state=42),
        X_leaky, y, cv=5, scoring='roc_auc'
    )
    
    # === Method 2: Proper Leave-One-Out within CV ===
    # More proper approach: encode each fold using only training fold data
    X_proper = np.zeros((n_samples, 1))
    
    from sklearn.model_selection import KFold
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    
    for train_idx, val_idx in kf.split(X_leaky):
        # Compute means only from training fold
        train_categories = categories[train_idx]
        train_y = y[train_idx]
        
        means = {}
        for cat in np.unique(train_categories):
            mask = train_categories == cat
            means[cat] = train_y[mask].mean() if mask.sum() > 0 else 0.5
        
        # Apply to validation fold
        global_mean = train_y.mean()
        for idx in val_idx:
            X_proper[idx, 0] = means.get(categories[idx], global_mean)
    
    proper_cv_scores = cross_val_score(
        GradientBoostingClassifier(n_estimators=50, random_state=42),
        X_proper, y, cv=5, scoring='roc_auc'
    )
    
    print("Target Leakage Demonstration")
    print("=" * 50)
    print(f"Ground truth: Categories have ZERO predictive power")
    print(f"Expected AUC: 0.50 (random chance)")
    print()
    print(f"Naive target encoding AUC: {leaky_cv_scores.mean():.3f}")
    print(f"  ^ Appears predictive due to leakage!")
    print()
    print(f"Proper fold-isolated encoding AUC: {proper_cv_scores.mean():.3f}")
    print(f"  ^ Correctly shows no predictive power")
 
demonstrate_target_leakage()

CatBoost's Ordered Target Statistics

The Ordered Target Statistics Algorithm

For a random permutation $\sigma$ of training samples and categorical feature $c$:

$$TS_i = \frac{\sum_{j: \sigma(j) < \sigma(i), c_j = c_i} y_j + a \cdot P}{\sum_{j: \sigma(j) < \sigma(i), c_j = c_i} 1 + a}$$

Where:

$\sigma(i)$ is sample $i$'s position in the permutation
$c_i$ is sample $i$'s category value
$a$ is a smoothing parameter (prior weight)
$P$ is the prior (typically global target mean)

Breaking Down the Formula

For sample $i$ with category "Paris":

Look at all samples appearing before $i$ in the permutation
Filter to those also in category "Paris"
Compute their mean target value
Blend with global prior for regularization

Critically: sample $i$'s own target $y_i$ is never used in computing $TS_i$.

Why This Eliminates Leakage

Visual Walkthrough

Consider samples in order after permutation:

Position	Sample	Category	Target	Samples Before (same cat)	Target Stats
1	A	Red	1	(none)	Prior = 0.5
2	B	Blue	0	(none)	Prior = 0.5
3	C	Red	0	A (Red, y=1)	(1 + 0.5·0.5)/(1+0.5) = 0.83
4	D	Red	1	A,C (Red)	(1+0 + 0.5·0.5)/(2+0.5) = 0.5
5	E	Blue	1	B (Blue, y=0)	(0 + 0.5·0.5)/(1+0.5) = 0.17

Notice:

Sample A has no preceding Red samples → uses prior
Sample C sees only A's target when computing its encoding
Sample D sees A and C's targets, but not its own
Each sample gets a unique encoding based on its position

ordered_target_statistics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from typing import Dict, List
 
def compute_ordered_target_statistics(
    categories: np.ndarray,
    targets: np.ndarray,
    prior: float = 0.5,
    prior_weight: float = 1.0,
    random_seed: int = 42
) -> np.ndarray:
    """
    Compute ordered target statistics as used by CatBoost.
    
    For each sample, the encoding uses only targets from samples appearing
    earlier in a random permutation that share the same category.
    
    Parameters:
    -----------
    categories : array of shape (n_samples,)
        Categorical feature values
    targets : array of shape (n_samples,)
        Target values (0/1 for classification, continuous for regression)
    prior : float
        Prior probability/value for regularization
    prior_weight : float
        Weight of prior (higher = more shrinkage toward prior)
    random_seed : int
        Random seed for permutation reproducibility
        
    Returns:
    --------
    encodings : array of shape (n_samples,)
        Ordered target statistics for each sample
    """
    np.random.seed(random_seed)
    n_samples = len(categories)
    
    # Generate random permutation
    permutation = np.random.permutation(n_samples)
    
    # Track running statistics per category
    # sum_targets[cat] = sum of targets seen so far for category cat
    # count[cat] = number of samples seen so far for category cat
    sum_targets: Dict[int, float] = {}
    count: Dict[int, int] = {}
    
    # Encodings for each sample (in original order)
    encodings = np.zeros(n_samples)
    
    # Process samples in permutation order
    for position in range(n_samples):
        sample_idx = permutation[position]
        cat = categories[sample_idx]
        target = targets[sample_idx]
        
        # Compute encoding using statistics BEFORE this sample
        cat_sum = sum_targets.get(cat, 0.0)
        cat_count = count.get(cat, 0)
        
        # Ordered target statistic with prior smoothing
        encodings[sample_idx] = (
            (cat_sum + prior_weight * prior) / 
            (cat_count + prior_weight)
        )
        
        # Update running statistics to include this sample
        # (for subsequent samples in the permutation)
        sum_targets[cat] = cat_sum + target
        count[cat] = cat_count + 1
    
    return encodings
 
 
def verify_no_leakage(categories: np.ndarray, targets: np.ndarray):
    """
    Verify that ordered target statistics don't cause leakage.
    
    For a random target with no relationship to categories,
    ordered encoding should show no predictive power.
    """
    from sklearn.model_selection import cross_val_score
    from sklearn.ensemble import GradientBoostingClassifier
    
    # Compute ordered target statistics
    encodings = compute_ordered_target_statistics(
        categories, targets, prior=0.5, prior_weight=1.0
    )
    
    X = encodings.reshape(-1, 1)
    
    # Cross-validate
    scores = cross_val_score(
        GradientBoostingClassifier(n_estimators=50, random_state=42),
        X, targets, cv=5, scoring='roc_auc'
    )
    
    print(f"Ordered Target Statistics AUC: {scores.mean():.3f} (+/- {scores.std():.3f})")
    print("Expected: ~0.50 for random targets (no leakage)")
    
    return scores.mean()
 
 
# Demonstrate with random data
if __name__ == "__main__":
    np.random.seed(42)
    n_samples = 1000
    categories = np.random.randint(0, 100, n_samples)
    targets = np.random.randint(0, 2, n_samples)  # Random targets
    
    print("Verification: Ordered encoding eliminates leakage")
    print("=" * 55)
    verify_no_leakage(categories, targets)

Handling High-Cardinality Categories

The Prior Smoothing Mechanism

The prior weight parameter $a$ in the target statistics formula isn't just regularization—it's essential for handling rare categories:

$$TS_i = \frac{\sum_{j: \sigma(j) < \sigma(i), c_j = c_i} y_j + a \cdot P}{\sum_{j: \sigma(j) < \sigma(i), c_j = c_i} 1 + a}$$

For a category seen only once before the current sample:

Numerator: $1 \cdot y_\text{prev} + a \cdot P$
Denominator: $1 + a$

If $a = 1$, the encoding is halfway between the single observed target and the prior. This prevents extreme encodings based on minimal data.

Behavior Analysis by Category Frequency

Target Statistics Behavior by Category Size (prior=0.5, a=1)
Category Size	Preceding Samples	Mean Target	Computed Encoding	Shrinkage Effect
0 (unseen)	None	N/A	0.50	100% prior (no data)
1	1 sample, y=1	1.0	0.75	50% shrinkage toward prior
5	All y=1	1.0	0.92	20% shrinkage
50	Mean 0.8	0.8	0.79	2% shrinkage
500	Mean 0.8	0.8	0.80	0.2% shrinkage (data dominates)

New Categories at Inference Time

A major advantage of target statistics: unseen categories gracefully default to the prior rather than failing:

During training: Category "NewProduct" never appears → no encoding computed
At inference: Sample has "NewProduct" → encoded as prior $P$
Model behavior: Treats new categories as "average" items

This contrasts sharply with one-hot encoding, which cannot represent unseen categories without retraining.

Multi-Level Combinations

CatBoost automatically generates combinations of categorical features when beneficial:

Individual features: Category A, Category B
Pair combination: (Category A, Category B) as single joint category
Higher-order: Triples, etc., up to specified limits

For example, with "Country" and "Browser":

Individual: Country encoding, Browser encoding
Combined: (Country, Browser) joint encoding

This captures interactions like "Chrome users from Japan behave differently than Chrome users from Brazil" without explicit feature engineering.

catboost_categorical_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
from catboost import CatBoostClassifier, Pool
import pandas as pd
import numpy as np
 
# Create sample data with high-cardinality categoricals
def create_sample_data():
    np.random.seed(42)
    n_samples = 10000
    
    data = {
        'user_id': [f'user_{i % 5000}' for i in range(n_samples)],
        'product_category': np.random.choice(
            ['electronics', 'clothing', 'food', 'books', 'sports'],
            n_samples
        ),
        'city': np.random.choice(
            [f'city_{i}' for i in range(200)],  # 200 cities
            n_samples
        ),
        'browser': np.random.choice(
            ['chrome', 'firefox', 'safari', 'edge', 'other'],
            n_samples
        ),
        'numeric_feature': np.random.randn(n_samples),
        'target': (np.random.rand(n_samples) > 0.7).astype(int)
    }
    return pd.DataFrame(data)
 
df = create_sample_data()
train_df = df.iloc[:8000]
test_df = df.iloc[8000:]
 
# Identify categorical columns
cat_features = ['user_id', 'product_category', 'city', 'browser']
 
# CatBoost handles categoricals natively - no preprocessing needed
model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    
    # ===========================
    # Categorical Feature Config
    # ===========================
    
    # Identify categorical feature indices or names
    cat_features=cat_features,
    
    # One-hot encoding threshold: categories with <= this many levels
    # are one-hot encoded; others use target statistics
    one_hot_max_size=10,  # One-hot for features with ≤10 categories
    
    # Maximum number of categorical feature combinations to consider
    max_ctr_complexity=2,  # Up to pairs of categoricals
    
    # Prior for target statistics regularization
    # Higher = more shrinkage toward prior
    simple_ctr=[
        'Borders:Prior=0.5:PriorWeight=1.0',
    ],
    
    random_seed=42,
    verbose=100,
)
 
# Create Pool objects (CatBoost's data container)
train_pool = Pool(
    train_df.drop('target', axis=1),
    train_df['target'],
    cat_features=cat_features
)
 
test_pool = Pool(
    test_df.drop('target', axis=1),
    test_df['target'],
    cat_features=cat_features
)
 
# Train - CatBoost automatically:
# 1. Computes ordered target statistics per permutation
# 2. Generates categorical feature combinations
# 3. Uses symmetric trees for efficient processing
model.fit(train_pool, eval_set=test_pool, early_stopping_rounds=50)
 
# Examine feature importance including categorical combinations
feature_importance = model.get_feature_importance(
    train_pool,
    type='PredictionValuesChange'
)
 
feature_names = model.feature_names_
print("\nTop Features by Importance:")
for name, importance in sorted(
    zip(feature_names, feature_importance),
    key=lambda x: -x[1]
)[:10]:
    print(f"  {name}: {importance:.3f}")
 
# Handle new categories at inference
# Create a sample with previously unseen categories
new_sample = pd.DataFrame({
    'user_id': ['completely_new_user'],  # Never seen in training
    'product_category': ['new_category'],  # Never seen
    'city': ['city_999'],  # Never seen
    'browser': ['chrome'],  # Seen
    'numeric_feature': [0.5]
})
 
# CatBoost handles this gracefully - no crash, uses priors
prediction = model.predict_proba(new_sample)
print(f"\nPrediction for unseen categories: {prediction}")
print("(Graceful fallback to prior - no errors)")

Advanced Categorical Configurations

CTR Types

CatBoost offers multiple target statistic types:

Borders (default): Standard ordered target statistics with borders for numerical discretization
Buckets: Discretize target statistics into fixed buckets
Counter: Use raw counts instead of means (useful for frequency-based features)
BinarizedTargetMeanValue: For regression targets, use mean value encoding

CTR Configuration Parameters

CTR Configuration Options
Parameter	Range	Default	Effect
Prior	[0, 1]	Mean target	Center for shrinkage on small categories
PriorWeight	(0, ∞)	1.0	Strength of shrinkage toward prior
MaxCtrComplexity	1-8	4	Max features in categorical combination
TargetBorderCount	1-255	1	Borders for binarizing regression targets
CtrBorderCount	1-255	15	Borders for discretizing CTR values

Feature-Specific CTR Settings

For datasets with heterogeneous categorical features, you can specify different CTR settings per feature:

model = CatBoostClassifier(
    per_feature_ctr={
        0: ['Borders:TargetBorderCount=1:PriorWeight=0.5'],  # Feature 0
        1: ['Counter'],  # Feature 1: use counts instead of means
        2: ['Borders:TargetBorderCount=5'],  # Feature 2: more granular
    }
)

Combining CTR Types

Multiple CTR types can be combined for the same feature, generating multiple derived features:

simple_ctr=[
    'Borders:TargetBorderCount=1',
    'Counter'
]

This creates two encodings per categorical feature: one target-mean based, one count-based. The model can learn to use whichever is more predictive.

Practical Guidance

advanced_ctr_configuration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
from catboost import CatBoostClassifier, Pool
import numpy as np
import pandas as pd
 
# Example: Configuring CTR for e-commerce prediction
def configure_advanced_ctr():
    """
    Demonstrates advanced CTR configuration for a realistic
    e-commerce purchase prediction problem.
    """
    
    model = CatBoostClassifier(
        iterations=1000,
        learning_rate=0.03,
        depth=6,
        
        # ===========================================
        # General CTR Configuration
        # ===========================================
        
        # Max combinations of categorical features
        # Higher = more expressive but slower and risk of overfitting
        max_ctr_complexity=3,  # Up to triples
        
        # One-hot threshold: use one-hot for low-cardinality
        one_hot_max_size=5,    # One-hot if ≤5 categories
        
        # ===========================================
        # Simple CTR: Applied to individual features
        # ===========================================
        simple_ctr=[
            # Primary: Target mean with moderate shrinkage
            'Borders:TargetBorderCount=1:PriorWeight=1.0',
            
            # Secondary: Frequency-based (useful for popularity signals)
            'Counter',
        ],
        
        # ===========================================
        # Combination CTR: Applied to feature combinations
        # ===========================================
        combinations_ctr=[
            # For combinations, use stronger shrinkage
            # (combinations have less data per category)
            'Borders:TargetBorderCount=1:PriorWeight=2.0',
        ],
        
        # ===========================================
        # Per-Feature CTR Customization
        # ===========================================
        # Assuming: 0=user_id, 1=category, 2=brand, 3=device
        per_feature_ctr={
            # User ID: Very high cardinality, aggressive shrinkage
            0: ['Borders:PriorWeight=5.0'],
            
            # Category: Moderate cardinality, standard config
            1: ['Borders:PriorWeight=1.0', 'Counter'],
            
            # Brand: May have popularity effects
            2: ['Borders:PriorWeight=1.0', 'Counter'],
            
            # Device: Low cardinality, will be one-hot encoded anyway
            3: [],  # Use one-hot only
        },
        
        # ===========================================
        # Regularization for overfitting prevention
        # ===========================================
        l2_leaf_reg=3.0,
        random_strength=0.5,
        bagging_temperature=0.3,
        
        random_seed=42,
        verbose=100,
    )
    
    return model
 
 
def analyze_ctr_features(model, pool):
    """
    Analyze how CatBoost created CTR-based features.
    """
    # Get all feature importances including CTR combinations
    importance = model.get_feature_importance(pool)
    feature_names = model.feature_names_
    
    # Categorize features
    original_features = []
    ctr_features = []
    combination_features = []
    
    for name, imp in zip(feature_names, importance):
        if '{' in name:  # CTR combination features contain braces
            combination_features.append((name, imp))
        elif ':' in name:  # CTR transforms contain colons
            ctr_features.append((name, imp))
        else:
            original_features.append((name, imp))
    
    print("Feature Importance Analysis")
    print("=" * 60)
    
    print("\nOriginal Features:")
    for name, imp in sorted(original_features, key=lambda x: -x[1])[:5]:
        print(f"  {name}: {imp:.3f}")
    
    print("\nCTR-transformed Features:")
    for name, imp in sorted(ctr_features, key=lambda x: -x[1])[:5]:
        print(f"  {name}: {imp:.3f}")
    
    print("\nCombination Features:")
    for name, imp in sorted(combination_features, key=lambda x: -x[1])[:5]:
        print(f"  {name}: {imp:.3f}")
 
# Usage example (with appropriate data)
model = configure_advanced_ctr()
print("Model configured with advanced CTR settings")
print(f"Simple CTR: {model.get_params().get('simple_ctr')}")
print(f"Combinations CTR: {model.get_params().get('combinations_ctr')}")

Categorical Handling: CatBoost vs XGBoost vs LightGBM

Understanding how CatBoost's categorical handling compares to XGBoost and LightGBM helps practitioners choose the right tool for their data.

XGBoost's Approach

Until recently, XGBoost had no native categorical support—all preprocessing was the user's responsibility. XGBoost 1.6+ added experimental categorical support:

Mechanism: Optimal partitioning of categories based on gradient statistics Pros: No target leakage by construction (uses gradients, not raw targets) Cons:

Limited to features explicitly marked categorical
No automatic combinations
Still less mature than CatBoost's implementation

LightGBM's Approach

LightGBM introduced native categorical support before XGBoost:

Mechanism: Gradient-based one-side sampling for category splitting Pros:

Fast and GPU-accelerated
No explicit target encoding (uses split optimization)
Handles moderate cardinality well

Cons:

Can struggle with very high cardinality (>1000 categories)
No ordered processing for leakage prevention
User must specify categorical features explicitly

Categorical Feature Handling Comparison
Aspect	CatBoost	LightGBM	XGBoost
Native support	Core feature since v1	Added 2017	Experimental since v1.6
Encoding method	Ordered target statistics	Gradient-based splits	Optimal partitioning
Target leakage	Eliminated by design	Not directly addressed	Avoided (uses gradients)
High cardinality (>1K)	Excellent	Moderate	Limited
Feature combinations	Automatic generation	Manual required	Manual required
New categories	Graceful (prior)	Problematic	Problematic
Configuration	Extensive CTR options	Basic on/off	Limited
Speed impact	Adds overhead for ordering	Minimal	Moderate

When to Choose CatBoost for Categoricals

Benchmark: Categorical Performance

On datasets with significant categorical content, CatBoost typically shows:

Accuracy advantage: 0.5-2% improvement in AUC on categorical-heavy datasets
Training time: 10-30% slower than LightGBM due to ordered processing
Memory: Slightly higher due to storing multiple permutations
Convenience: Significantly less preprocessing code required

The training time penalty is often acceptable given:

Reduced time spent on feature engineering
Fewer data leakage bugs to debug
Better out-of-box performance
Simpler production pipelines (no preprocessing step)

Real-World Impact

Summary and Key Takeaways

Key Takeaways

•Traditional encodings have fundamental limitations: One-hot explodes dimensions, label encoding creates false orderings, and naive target encoding causes severe data leakage.
•Ordered target statistics eliminate leakage: By using only samples appearing earlier in a random permutation, each sample's encoding never sees its own target.
•Prior smoothing handles rare categories: The blend with global prior prevents extreme encodings from small samples and gracefully handles unseen categories.
•Automatic feature combinations: CatBoost generates categorical interactions without manual feature engineering, capturing complex relationships.
•Extensive configuration options: CTR types, per-feature settings, and combination complexity allow fine-tuning for specific use cases.
•Competitive advantage over other frameworks: Native categorical support with leakage prevention is CatBoost's distinguishing strength.

Looking Ahead

Next, we'll dive deeper into target statistics—exploring the mathematical foundations of different CTR types and how to optimize them for specific prediction tasks.