Machine LearningModern Boosting

CatBoost

LevelAdvanced

Duration90 mins

TopicModern Boosting

3 / 5

Target Statistics

The Mathematical Heart of CatBoost

Target statistics (often called CTR from "Click-Through Rate" in CatBoost's advertising origins) are the mathematical mechanisms that transform categorical features into numerical representations suitable for gradient boosting. While we introduced the basic concept in the previous section, this page dives deep into the mathematical foundations, variants, and optimization strategies.

Understanding target statistics at this depth enables practitioners to:

Configure CatBoost optimally for specific data characteristics
Debug unexpected model behaviors related to categorical encoding
Make informed tradeoffs between computation and accuracy
Extend the concepts to custom encoding schemes

Why Depth Matters Here

Target statistics configuration is one of the highest-leverage tuning areas in CatBoost. Small changes to CTR parameters can significantly impact model performance, especially on categorical-heavy datasets. This page provides the foundation for principled tuning rather than trial-and-error.

Mathematical Foundation of Target Statistics

Let's formalize target statistics rigorously. Consider a dataset $\mathcal{D} = {(x_1, y_1), ..., (x_n, y_n)}$ with a categorical feature $c$ taking values in $\mathcal{C} = {c_1, ..., c_k}$.

The Core Formulation

For a permutation $\sigma$ of training indices, the ordered target statistic for sample $i$ is:

$$TS^{\sigma}i = \frac{\sum{j: \sigma(j) < \sigma(i), c_j = c_i} w_j \cdot g(y_j) + a \cdot P}{\sum_{j: \sigma(j) < \sigma(i), c_j = c_i} w_j + a}$$

Where:

$\sigma(j) < \sigma(i)$ ensures only preceding samples are used
$c_j = c_i$ filters to the same category
$w_j$ are sample weights (1 if unweighted)
$g(y_j)$ is a transformation of the target (typically identity)
$a$ is the prior weight (regularization strength)
$P$ is the prior value (typically global mean)

Interpretation as Bayesian Posterior

The formula has an elegant Bayesian interpretation. If we model the target probability for category $c$ as coming from a Beta distribution with prior $\text{Beta}(a \cdot P, a \cdot (1-P))$:

$$P(\theta_c | \text{data}) \propto P(\text{data} | \theta_c) \cdot P(\theta_c)$$

Then $TS^\sigma_i$ is the posterior mean after observing the preceding samples:

$$\mathbb{E}[\theta_c | \text{preceding samples}] = TS^\sigma_i$$

This Bayesian view explains why the formula works:

Prior $(a, P)$ represents our belief before seeing data
Observations from preceding samples update this belief
The posterior mean balances prior and observed evidence

The Bayesian Insight

Viewing target statistics as Bayesian posteriors provides intuition for parameter tuning. Higher prior weight $a$ means stronger initial beliefs that require more evidence to overcome. The prior $P$ should reflect your base rate expectation for unpopular categories.

Variance Analysis

The variance of target statistics depends on:

Number of preceding samples in the category
Variance of the target within the category
Prior weight $a$

For $n_c$ preceding samples in category $c$ with variance $\sigma^2_c$:

$$\text{Var}(TS^\sigma_i) \approx \frac{\sigma^2_c}{n_c + a} + O\left(\frac{1}{(n_c + a)^2}\right)$$

Key implications:

Rare categories have high variance → encoding is noisy
Prior weight $a$ acts as effective "pseudo-samples" → reduces variance
As $n_c \to \infty$, variance vanishes → encoding becomes accurate

Variance of Target Statistics by Category Size and Prior Weight
Category Size (n_c)	Prior Weight a=1	Prior Weight a=5	Prior Weight a=10
1	σ²/2 = 0.125	σ²/6 = 0.042	σ²/11 = 0.023
5	σ²/6 = 0.042	σ²/10 = 0.025	σ²/15 = 0.017
20	σ²/21 = 0.012	σ²/25 = 0.010	σ²/30 = 0.008
100	σ²/101 = 0.0025	σ²/105 = 0.0024	σ²/110 = 0.0023

Types of Target Statistics in CatBoost

CatBoost provides multiple target statistic types, each optimized for different scenarios. Understanding when to use each type is crucial for optimal model performance.

1. Borders (Default)

The standard target mean statistic, discretized using histogram borders:

$$TS_{\text{Borders}}(i) = \text{Bin}\left(\frac{\sum_{j \prec i} y_j + a \cdot P}{\sum_{j \prec i} 1 + a}\right)$$

Where $\text{Bin}(\cdot)$ maps the continuous value to one of $B$ histogram bins.

Use cases:

Classification problems (binary or multiclass)
Regression with bounded targets
General-purpose default

Key parameters:

TargetBorderCount: Number of target binarizations (1 for binary classification)
CtrBorderCount: Number of CTR discretization bins (default 15)

2. Buckets

Instead of soft CTR values, categories are mapped to integer bucket indices:

$$TS_{\text{Buckets}}(i) = \min\left(B, \left\lfloor B \cdot \frac{\sum_{j \prec i} y_j + a \cdot P}{\sum_{j \prec i} 1 + a} \right\rfloor\right)$$

Use cases:

When you want cleaner categorical representations
Memory-constrained scenarios
When interpretability of bins matters

3. Counter

Uses raw category counts instead of target means:

$$TS_{\text{Counter}}(i) = \log_2(\text{count}_{c_i} + 1)$$

Where $\text{count}_{c_i}$ is the number of preceding samples with category $c_i$.

Use cases:

Features where frequency itself is predictive (popularity signals)
Combining with Borders for richer representation
When target mean is uninformative but frequency matters

4. BinarizedTargetMeanValue

For regression tasks, binarizes the target at multiple thresholds and computes means:

$$TS_{\text{BTMV}}(i, b) = \frac{\sum_{j \prec i} \mathbf{1}[y_j > \tau_b] + a \cdot P_b}{\sum_{j \prec i} 1 + a}$$

Where $\tau_1, ..., \tau_B$ are target thresholds.

Use cases:

Regression problems with skewed targets
When different target ranges have different category effects
Multi-threshold classification

5. FeatureFreq

Categorical feature frequency in the dataset:

$$TS_{\text{FeatureFreq}}(i) = \frac{\text{count}_{c_i}}{n}$$

Use cases:

When category frequency strongly predicts target
Popularity or common-ness signals
Combining with other CTR types

Target Statistic Types Comparison
Type	Output	Leakage-Free	Best For	Memory
Borders	Binned mean	Yes (ordered)	Classification, general	Moderate
Buckets	Bucket index	Yes (ordered)	Clean categories	Low
Counter	Log count	Yes (ordered)	Frequency signals	Low
BinarizedTargetMeanValue	Multiple binned means	Yes (ordered)	Regression	Higher
FeatureFreq	Frequency ratio	Yes (no target)	Popularity	Low

Prior Configuration and Its Impact

The prior $(P, a)$ represents our default belief about category effects before observing data. Getting this right is crucial for handling rare categories and cold-start scenarios.

Prior Value (P)

The prior value $P$ is the fallback encoding for categories with no preceding observations:

Default behavior: CatBoost uses the global target mean
Custom setting: Prior=0.5 (or any value in [0,1] for classification)

Guidance for setting P:

Scenario	Recommended P	Reasoning
Balanced classification	0.5	Neutral default
Imbalanced (90% positive)	0.9	Matches base rate
Rare positive events	0.01-0.1	Conservative; avoids overconfidence
Regression (normalized)	0.0 or mean	Center of target distribution

Prior Weight (a)

The prior weight $a$ determines regularization strength—how much data is needed to overcome the prior:

Low a (0.1-1): Prior easily overridden; trust data quickly
Medium a (1-10): Balanced; common default
High a (10-100): Prior sticky; require substantial evidence

Guidance for setting a:

Scenario	Recommended a	Reasoning
High-cardinality (>1000 cats)	5-20	Many rare categories need shrinkage
Low-cardinality (<100 cats)	0.5-2	Categories have enough data
Noisy targets	10-50	Reduce variance from noise
Clean, high-signal data	0.5-1	Let data speak
Cold-start critical	10+	New categories get safe default

Tuning Strategy

Start with default priors and tune only if validation performance suggests issues. Signs of wrong prior weight: (1) rare categories perform much worse than common ones → increase a, (2) model ignores clear category signals → decrease a. Use category-stratified validation to detect these patterns.

prior_configuration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
from catboost import CatBoostClassifier, Pool
import numpy as np
import pandas as pd
 
def experiment_with_priors(X_train, y_train, X_test, y_test, cat_features):
    """
    Experiment with different prior configurations to understand their impact.
    """
    
    results = []
    
    # Prior configurations to test
    configs = [
        {'name': 'Default', 'ctr': None},
        {'name': 'Low prior weight', 'ctr': ['Borders:Prior=0.5:PriorWeight=0.5']},
        {'name': 'Medium prior weight', 'ctr': ['Borders:Prior=0.5:PriorWeight=5.0']},
        {'name': 'High prior weight', 'ctr': ['Borders:Prior=0.5:PriorWeight=20.0']},
        {'name': 'Class-adjusted prior', 'ctr': [f'Borders:Prior={y_train.mean():.3f}:PriorWeight=5.0']},
    ]
    
    for config in configs:
        model_params = {
            'iterations': 500,
            'learning_rate': 0.05,
            'depth': 6,
            'cat_features': cat_features,
            'random_seed': 42,
            'verbose': 0,
            'early_stopping_rounds': 50,
        }
        
        if config['ctr'] is not None:
            model_params['simple_ctr'] = config['ctr']
        
        model = CatBoostClassifier(**model_params)
        
        train_pool = Pool(X_train, y_train, cat_features=cat_features)
        test_pool = Pool(X_test, y_test, cat_features=cat_features)
        
        model.fit(train_pool, eval_set=test_pool, use_best_model=True)
        
        # Evaluate
        from sklearn.metrics import roc_auc_score
        probs = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, probs)
        
        results.append({
            'config': config['name'],
            'auc': auc,
            'best_iteration': model.best_iteration_
        })
        
        print(f"{config['name']:30s} AUC: {auc:.4f} (iter: {model.best_iteration_})")
    
    return pd.DataFrame(results)
 
 
def analyze_category_performance(model, X_test, y_test, cat_feature_name):
    """
    Analyze model performance stratified by category size.
    
    Helps diagnose whether prior configuration is appropriate.
    """
    from sklearn.metrics import roc_auc_score
    
    # Get predictions
    probs = model.predict_proba(X_test)[:, 1]
    
    # Count category frequencies
    cat_values = X_test[cat_feature_name]
    cat_counts = cat_values.value_counts()
    
    # Stratify by category size
    size_buckets = {
        'Rare (1-5)': (1, 5),
        'Uncommon (6-20)': (6, 20),
        'Common (21-100)': (21, 100),
        'Frequent (100+)': (101, float('inf'))
    }
    
    print(f"\nPerformance by category size for '{cat_feature_name}':")
    print("-" * 50)
    
    for bucket_name, (min_count, max_count) in size_buckets.items():
        # Find categories in this size bucket
        cats_in_bucket = cat_counts[
            (cat_counts >= min_count) & (cat_counts <= max_count)
        ].index
        
        # Filter samples to these categories
        mask = cat_values.isin(cats_in_bucket)
        n_samples = mask.sum()
        
        if n_samples > 10:  # Need enough samples for AUC
            bucket_auc = roc_auc_score(y_test[mask], probs[mask])
            print(f"  {bucket_name:20s}: AUC={bucket_auc:.4f} (n={n_samples})")
        else:
            print(f"  {bucket_name:20s}: Insufficient samples (n={n_samples})")
    
    # If rare categories perform much worse, increase prior weight
    # If common categories perform worse, prior weight may be too high

Automatic Feature Combinations

One of CatBoost's most powerful capabilities is automatically generating and evaluating combinations of categorical features. This captures interaction effects without manual feature engineering.

The Combination Mechanism

For categorical features $c^{(1)}, c^{(2)}, ..., c^{(k)}$, CatBoost creates combined features:

$$c^{\text{combined}} = (c^{(1)}_i, c^{(2)}_i, ..., c^{(k)}_i)$$

The combined feature is treated as a new categorical with cardinality up to $|\mathcal{C}_1| \times |\mathcal{C}_2| \times ... \times |\mathcal{C}_k|$.

Target statistics are then computed for this combined category in the same ordered manner:

$$TS^{\text{combined}}i = \frac{\sum{j \prec i, c^{\text{combined}}_j = c^{\text{combined}}i} y_j + a \cdot P}{\sum{j \prec i, c^{\text{combined}}_j = c^{\text{combined}}_i} 1 + a}$$

Why Combinations Matter

Consider predicting purchase intent with features: Country, Browser, Device

Individual effects:

Country: Regional purchasing patterns
Browser: Tech-savvy indicators
Device: Wealth/platform preferences

Interaction effects (combinations capture):

(Country, Device): iPhone users in Japan vs USA
(Browser, Device): Safari-on-iPhone vs Chrome-on-Android behaviors
(Country, Browser, Device): Rich 3-way interaction patterns

Without combinations, the model must approximately learn these interactions through complex tree structures. With combinations, they become direct features.

The Sparsity Challenge

Combinations increase cardinality multiplicatively. For Country (200) × Browser (5) × Device (3), you get up to 3,000 combined categories. Most will have few samples, making prior weight $a$ even more important. CatBoost handles this automatically but watch for overfitting on rare combinations.

Controlling Combination Complexity

The max_ctr_complexity parameter limits the maximum number of features combined:

max_ctr_complexity	Combinations Generated	Cardinality Explosion	Training Time
1	Individual only	None	Fastest
2	Pairs	$O(k^2)$	Moderate
3	Triples	$O(k^3)$	Slower
4 (default)	Quadruples	$O(k^4)$	Significant

Recommendation: Start with 2 or 3. Only increase if validation shows benefit and training time is acceptable.

Selective Combination Configuration

Not all feature pairs benefit equally from combination. Use allowed_ctr_feature_combinations to specify which pairs to combine:

# Only combine feature pairs that make domain sense
allowed_combos = [
    (0, 1),  # Combine features at index 0 and 1
    (1, 2),  # Combine features at index 1 and 2
    # Skip (0, 2) if that combination isn't meaningful
]

feature_combinations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
from catboost import CatBoostClassifier, Pool
import numpy as np
import pandas as pd
 
def analyze_feature_combinations(model, pool):
    """
    Analyze which feature combinations CatBoost found most useful.
    """
    # Get feature importances
    importances = model.get_feature_importance(pool, type='PredictionValuesChange')
    feature_names = model.feature_names_
    
    # Categorize by combination complexity
    single_features = []
    pair_combinations = []
    higher_order = []
    
    for name, imp in zip(feature_names, importances):
        if '{' not in name:
            single_features.append((name, imp))
        else:
            # Count features in combination by counting commas
            n_features = name.count(',') + 1
            if n_features == 2:
                pair_combinations.append((name, imp))
            else:
                higher_order.append((name, imp))
    
    print("Feature Importance by Combination Level")
    print("=" * 60)
    
    # Aggregate by level
    single_total = sum(imp for _, imp in single_features)
    pair_total = sum(imp for _, imp in pair_combinations)
    higher_total = sum(imp for _, imp in higher_order)
    total = single_total + pair_total + higher_total
    
    print(f"\nImportance Distribution:")
    print(f"  Single features:    {100*single_total/total:5.1f}%")
    print(f"  Pair combinations:  {100*pair_total/total:5.1f}%")
    print(f"  Higher-order:       {100*higher_total/total:5.1f}%")
    
    print(f"\nTop Single Features:")
    for name, imp in sorted(single_features, key=lambda x: -x[1])[:5]:
        print(f"  {name}: {imp:.3f}")
    
    print(f"\nTop Pair Combinations:")
    for name, imp in sorted(pair_combinations, key=lambda x: -x[1])[:5]:
        print(f"  {name}: {imp:.3f}")
    
    return {
        'single': single_features,
        'pairs': pair_combinations,
        'higher': higher_order
    }
 
 
def compare_combination_depths():
    """
    Compare model performance with different max_ctr_complexity settings.
    """
    np.random.seed(42)
    n_samples = 10000
    
    # Create data with genuine interaction effects
    country = np.random.choice(['US', 'UK', 'DE', 'JP', 'FR'], n_samples)
    device = np.random.choice(['mobile', 'desktop', 'tablet'], n_samples)
    browser = np.random.choice(['chrome', 'safari', 'firefox', 'edge'], n_samples)
    
    # Target depends on interactions
    # US + mobile has high conversion; JP + desktop has high conversion
    base_prob = 0.1
    interaction_bonus = np.where(
        ((country == 'US') & (device == 'mobile')) |
        ((country == 'JP') & (device == 'desktop')),
        0.4, 0.0
    )
    probs = base_prob + interaction_bonus
    y = (np.random.rand(n_samples) < probs).astype(int)
    
    df = pd.DataFrame({
        'country': country,
        'device': device,
        'browser': browser,
        'target': y
    })
    
    # Split
    train_df = df.iloc[:8000]
    test_df = df.iloc[8000:]
    
    cat_features = ['country', 'device', 'browser']
    
    results = []
    for complexity in [1, 2, 3, 4]:
        model = CatBoostClassifier(
            iterations=300,
            learning_rate=0.1,
            depth=4,
            cat_features=cat_features,
            max_ctr_complexity=complexity,
            random_seed=42,
            verbose=0,
        )
        
        train_pool = Pool(
            train_df.drop('target', axis=1),
            train_df['target'],
            cat_features=cat_features
        )
        test_pool = Pool(
            test_df.drop('target', axis=1),
            test_df['target'],
            cat_features=cat_features
        )
        
        model.fit(train_pool)
        
        from sklearn.metrics import roc_auc_score
        probs = model.predict_proba(test_df.drop('target', axis=1))[:, 1]
        auc = roc_auc_score(test_df['target'], probs)
        
        results.append({
            'max_ctr_complexity': complexity,
            'auc': auc,
            'n_features': len(model.feature_names_)
        })
        
        print(f"Complexity {complexity}: AUC={auc:.4f}, Features={len(model.feature_names_)}")
    
    return pd.DataFrame(results)
 
# Run comparison
print("\nComparison of max_ctr_complexity settings:")
print("-" * 50)
compare_combination_depths()

Target Statistics for Regression Problems

While target statistics were developed for classification (estimating probabilities), they extend naturally to regression with some modifications.

Direct Mean Encoding for Regression

The simplest approach: compute ordered means of continuous targets:

$$TS^{\text{reg}}i = \frac{\sum{j \prec i, c_j = c_i} y_j + a \cdot \mu}{\sum_{j \prec i, c_j = c_i} 1 + a}$$

Where $\mu$ is the global target mean (prior). This works well when:

Targets are roughly normally distributed
Category effects are approximately additive
Target scale is consistent across categories

Quantile-Based Target Statistics

For skewed or heavy-tailed targets, binarizing at multiple quantiles captures more information:

Compute target quantiles $\tau_{0.25}, \tau_{0.5}, \tau_{0.75}$
For each threshold $\tau$, compute: $$TS^{\tau}_i = P(Y > \tau | \text{category}, \text{preceding})$$
Resulting features: multiple binary indicators of target level

This approach:

Handles outliers gracefully
Captures non-linear category effects
Provides richer signal for complex distributions

When to Use Quantile Binarization

Use TargetBorderCount > 1 when: (1) targets have outliers or long tails, (2) you suspect category effects differ across the target distribution (e.g., category affects P(y > median) differently than P(y > 90th percentile)), (3) you're solving heterogeneous regression problems.

Configuration for Regression

model = CatBoostRegressor(
    simple_ctr=[
        # Mean encoding with shrinkage
        'Borders:TargetBorderCount=1:PriorWeight=5.0',
        
        # Quantile binarization for robustness
        'Borders:TargetBorderCount=3:PriorWeight=2.0',
    ],
    combinations_ctr=[
        # Use more quantiles for combinations (more data needed)
        'Borders:TargetBorderCount=2:PriorWeight=10.0',
    ],
)

Target Normalization Considerations

For target statistics to work well in regression:

Scale targets appropriately: Raw targets in millions vs. normalized [0,1] affects optimal prior weight
Handle zeros specially: If many targets are zero (e.g., sales forecasting), consider log-transform or separate zero indicator
Clip outliers: Extreme targets can destabilize mean computations

regression_target_statistics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
from catboost import CatBoostRegressor, Pool
import numpy as np
import pandas as pd
 
def configure_regression_ctr(target_distribution='normal'):
    """
    Configure CTR settings appropriate for different target distributions.
    """
    
    if target_distribution == 'normal':
        # Standard normally distributed targets
        return CatBoostRegressor(
            simple_ctr=[
                'Borders:TargetBorderCount=1:PriorWeight=3.0',
            ],
        )
    
    elif target_distribution == 'lognormal':
        # Right-skewed targets (e.g., prices, durations)
        return CatBoostRegressor(
            simple_ctr=[
                # Mean encoding
                'Borders:TargetBorderCount=1:PriorWeight=5.0',
                # Quantile encoding for tail behavior
                'Borders:TargetBorderCount=4:PriorWeight=3.0',
            ],
        )
    
    elif target_distribution == 'heavy_tailed':
        # Targets with outliers (e.g., claims amounts)
        return CatBoostRegressor(
            simple_ctr=[
                # Multiple quantiles for robust estimation
                'Borders:TargetBorderCount=5:PriorWeight=5.0',
                # Counter for frequency signal
                'Counter',
            ],
            loss_function='MAE',  # Also consider robust loss
        )
    
    elif target_distribution == 'zero_inflated':
        # Many zeros (e.g., purchase amounts, rare events)
        return CatBoostRegressor(
            simple_ctr=[
                # Low threshold captures P(y > 0)
                'Borders:TargetBorderCount=1:PriorWeight=2.0',
                # Higher thresholds for conditional distribution
                'Borders:TargetBorderCount=3:PriorWeight=5.0',
            ],
        )
 
 
def demonstrate_quantile_ctr():
    """
    Show how TargetBorderCount affects what CatBoost learns.
    """
    np.random.seed(42)
    n_samples = 5000
    
    # Create data with heterogeneous category effects
    # Category A: higher mean, lower variance
    # Category B: lower mean, higher variance
    categories = np.random.choice(['A', 'B', 'C', 'D'], n_samples)
    
    targets = np.zeros(n_samples)
    for i, cat in enumerate(categories):
        if cat == 'A':
            targets[i] = np.random.normal(100, 10)  # High mean, low variance
        elif cat == 'B':
            targets[i] = np.random.normal(50, 30)   # Low mean, high variance
        elif cat == 'C':
            targets[i] = np.random.exponential(30)  # Skewed
        else:
            targets[i] = np.random.normal(70, 20)   # Medium
    
    df = pd.DataFrame({'category': categories, 'target': targets})
    
    # Compare different TargetBorderCount settings
    results = {}
    for n_borders in [1, 3, 5]:
        model = CatBoostRegressor(
            iterations=200,
            learning_rate=0.1,
            depth=4,
            cat_features=['category'],
            simple_ctr=[f'Borders:TargetBorderCount={n_borders}:PriorWeight=3.0'],
            random_seed=42,
            verbose=0,
        )
        
        # Simple train/test split
        train_df = df.iloc[:4000]
        test_df = df.iloc[4000:]
        
        train_pool = Pool(
            train_df[['category']], train_df['target'], cat_features=['category']
        )
        test_pool = Pool(
            test_df[['category']], test_df['target'], cat_features=['category']
        )
        
        model.fit(train_pool)
        preds = model.predict(test_df[['category']])
        
        from sklearn.metrics import mean_squared_error, mean_absolute_error
        rmse = np.sqrt(mean_squared_error(test_df['target'], preds))
        mae = mean_absolute_error(test_df['target'], preds)
        
        results[n_borders] = {'rmse': rmse, 'mae': mae}
        print(f"TargetBorderCount={n_borders}: RMSE={rmse:.2f}, MAE={mae:.2f}")
    
    return results
 
print("Demonstrating quantile-based CTR for heterogeneous data:")
print("-" * 60)
demonstrate_quantile_ctr()

Ordering and Permutation Strategies

The choice and number of permutations affects both the statistical properties and computational cost of target statistics.

Why Multiple Permutations?

A single permutation introduces position-dependent behavior:

Early samples see empty categories → encoding is just prior
Late samples see all preceding data → encoding is most informative
This creates systematic differences based on position

Multiple permutations smooth this:

Different samples are "early" and "late" in different orderings
Averaging predictions across permutations reduces variance
Statistical properties become more uniform

CatBoost's Permutation Strategy

CatBoost maintains multiple permutation-specific models during training:

For structure learning: Uses one permutation to determine tree splits
For leaf values: Uses different permutations to compute leaf values
At inference: Averages across permutation-specific predictions

Default: 4 permutations. Configurable via permutation_count parameter.

Impact of Number of Permutations
Permutations	Variance	Training Time	Memory	Recommended Use
1	High (position-dependent)	Fastest	Lowest	Quick experiments only
4 (default)	Moderate	Moderate	Moderate	Production default
8	Low	~2x baseline	~2x baseline	High-stakes, small data
16	Very low	~4x baseline	~4x baseline	Research, maximum quality

Seed and Reproducibility

For reproducible target statistics:

model = CatBoostClassifier(
    random_seed=42,  # Seeds all randomness including permutations
)

With the same seed:

Same permutations are generated
Same target statistics are computed
Model is fully reproducible

Time-Aware Ordering

For time-series or temporal data, consider ordered boosting modes:

Plain: Standard random permutation (default)
Ordered: Time-ordered for temporal leakage prevention
Custom: User-provided ordering

model = CatBoostClassifier(
    has_time=True,  # Interpret first column as timestamp
    boosting_type='Ordered',
)

With has_time=True, samples are ordered chronologically for target statistics, preventing future data from leaking into past predictions.

Temporal Leakage in Target Statistics

For time-series data, random permutations allow future targets to influence past encodings. Set has_time=True to enforce chronological ordering. This is critical for forecasting applications where temporal integrity must be preserved.

Summary and Key Takeaways

Target statistics are the mathematical engine that powers CatBoost's categorical feature handling. Understanding their mechanics enables practitioners to tune models optimally for their specific data characteristics.

Key Takeaways

•Target statistics have Bayesian interpretation: The formula computes a posterior mean, blending observed data with prior beliefs.
•Multiple CTR types exist: Borders (default), Buckets, Counter, and BinarizedTargetMeanValue serve different purposes.
•Prior configuration is high-leverage: Prior value $P$ sets the fallback; prior weight $a$ controls shrinkage strength.
•Automatic feature combinations capture interactions: CatBoost generates pairs, triples, etc., controlled by max_ctr_complexity.
•Regression requires special consideration: Use TargetBorderCount > 1 for skewed or heterogeneous targets.
•Multiple permutations reduce variance: Default 4 permutations balance quality and computational cost.

Looking Ahead

Next, we'll explore CatBoost's symmetric (oblivious) decision trees—the structural innovation that makes ordered boosting computationally efficient and enables extremely fast inference.

3 / 5

Loading learning content...

Machine LearningModern Boosting

CatBoost

LevelAdvanced

Duration90 mins

TopicModern Boosting

3 / 5

Target Statistics

The Mathematical Heart of CatBoost

Understanding target statistics at this depth enables practitioners to:

Configure CatBoost optimally for specific data characteristics
Debug unexpected model behaviors related to categorical encoding
Make informed tradeoffs between computation and accuracy
Extend the concepts to custom encoding schemes

Why Depth Matters Here

Mathematical Foundation of Target Statistics

Let's formalize target statistics rigorously. Consider a dataset $\mathcal{D} = {(x_1, y_1), ..., (x_n, y_n)}$ with a categorical feature $c$ taking values in $\mathcal{C} = {c_1, ..., c_k}$.

The Core Formulation

For a permutation $\sigma$ of training indices, the ordered target statistic for sample $i$ is:

$$TS^{\sigma}i = \frac{\sum{j: \sigma(j) < \sigma(i), c_j = c_i} w_j \cdot g(y_j) + a \cdot P}{\sum_{j: \sigma(j) < \sigma(i), c_j = c_i} w_j + a}$$

Where:

$\sigma(j) < \sigma(i)$ ensures only preceding samples are used
$c_j = c_i$ filters to the same category
$w_j$ are sample weights (1 if unweighted)
$g(y_j)$ is a transformation of the target (typically identity)
$a$ is the prior weight (regularization strength)
$P$ is the prior value (typically global mean)

Interpretation as Bayesian Posterior

The formula has an elegant Bayesian interpretation. If we model the target probability for category $c$ as coming from a Beta distribution with prior $\text{Beta}(a \cdot P, a \cdot (1-P))$:

$$P(\theta_c | \text{data}) \propto P(\text{data} | \theta_c) \cdot P(\theta_c)$$

Then $TS^\sigma_i$ is the posterior mean after observing the preceding samples:

$$\mathbb{E}[\theta_c | \text{preceding samples}] = TS^\sigma_i$$

This Bayesian view explains why the formula works:

Prior $(a, P)$ represents our belief before seeing data
Observations from preceding samples update this belief
The posterior mean balances prior and observed evidence

The Bayesian Insight

Variance Analysis

The variance of target statistics depends on:

Number of preceding samples in the category
Variance of the target within the category
Prior weight $a$

For $n_c$ preceding samples in category $c$ with variance $\sigma^2_c$:

$$\text{Var}(TS^\sigma_i) \approx \frac{\sigma^2_c}{n_c + a} + O\left(\frac{1}{(n_c + a)^2}\right)$$

Key implications:

Rare categories have high variance → encoding is noisy
Prior weight $a$ acts as effective "pseudo-samples" → reduces variance
As $n_c \to \infty$, variance vanishes → encoding becomes accurate

Variance of Target Statistics by Category Size and Prior Weight
Category Size (n_c)	Prior Weight a=1	Prior Weight a=5	Prior Weight a=10
1	σ²/2 = 0.125	σ²/6 = 0.042	σ²/11 = 0.023
5	σ²/6 = 0.042	σ²/10 = 0.025	σ²/15 = 0.017
20	σ²/21 = 0.012	σ²/25 = 0.010	σ²/30 = 0.008
100	σ²/101 = 0.0025	σ²/105 = 0.0024	σ²/110 = 0.0023

Types of Target Statistics in CatBoost

CatBoost provides multiple target statistic types, each optimized for different scenarios. Understanding when to use each type is crucial for optimal model performance.

1. Borders (Default)

The standard target mean statistic, discretized using histogram borders:

$$TS_{\text{Borders}}(i) = \text{Bin}\left(\frac{\sum_{j \prec i} y_j + a \cdot P}{\sum_{j \prec i} 1 + a}\right)$$

Where $\text{Bin}(\cdot)$ maps the continuous value to one of $B$ histogram bins.

Use cases:

Classification problems (binary or multiclass)
Regression with bounded targets
General-purpose default

Key parameters:

TargetBorderCount: Number of target binarizations (1 for binary classification)
CtrBorderCount: Number of CTR discretization bins (default 15)

2. Buckets

Instead of soft CTR values, categories are mapped to integer bucket indices:

$$TS_{\text{Buckets}}(i) = \min\left(B, \left\lfloor B \cdot \frac{\sum_{j \prec i} y_j + a \cdot P}{\sum_{j \prec i} 1 + a} \right\rfloor\right)$$

Use cases:

When you want cleaner categorical representations
Memory-constrained scenarios
When interpretability of bins matters

3. Counter

Uses raw category counts instead of target means:

$$TS_{\text{Counter}}(i) = \log_2(\text{count}_{c_i} + 1)$$

Where $\text{count}_{c_i}$ is the number of preceding samples with category $c_i$.

Use cases:

Features where frequency itself is predictive (popularity signals)
Combining with Borders for richer representation
When target mean is uninformative but frequency matters

4. BinarizedTargetMeanValue

For regression tasks, binarizes the target at multiple thresholds and computes means:

$$TS_{\text{BTMV}}(i, b) = \frac{\sum_{j \prec i} \mathbf{1}[y_j > \tau_b] + a \cdot P_b}{\sum_{j \prec i} 1 + a}$$

Where $\tau_1, ..., \tau_B$ are target thresholds.

Use cases:

Regression problems with skewed targets
When different target ranges have different category effects
Multi-threshold classification

5. FeatureFreq

Categorical feature frequency in the dataset:

$$TS_{\text{FeatureFreq}}(i) = \frac{\text{count}_{c_i}}{n}$$

Use cases:

When category frequency strongly predicts target
Popularity or common-ness signals
Combining with other CTR types

Target Statistic Types Comparison
Type	Output	Leakage-Free	Best For	Memory
Borders	Binned mean	Yes (ordered)	Classification, general	Moderate
Buckets	Bucket index	Yes (ordered)	Clean categories	Low
Counter	Log count	Yes (ordered)	Frequency signals	Low
BinarizedTargetMeanValue	Multiple binned means	Yes (ordered)	Regression	Higher
FeatureFreq	Frequency ratio	Yes (no target)	Popularity	Low

Prior Configuration and Its Impact

The prior $(P, a)$ represents our default belief about category effects before observing data. Getting this right is crucial for handling rare categories and cold-start scenarios.

Prior Value (P)

The prior value $P$ is the fallback encoding for categories with no preceding observations:

Default behavior: CatBoost uses the global target mean
Custom setting: Prior=0.5 (or any value in [0,1] for classification)

Guidance for setting P:

Scenario	Recommended P	Reasoning
Balanced classification	0.5	Neutral default
Imbalanced (90% positive)	0.9	Matches base rate
Rare positive events	0.01-0.1	Conservative; avoids overconfidence
Regression (normalized)	0.0 or mean	Center of target distribution

Prior Weight (a)

The prior weight $a$ determines regularization strength—how much data is needed to overcome the prior:

Low a (0.1-1): Prior easily overridden; trust data quickly
Medium a (1-10): Balanced; common default
High a (10-100): Prior sticky; require substantial evidence

Guidance for setting a:

Scenario	Recommended a	Reasoning
High-cardinality (>1000 cats)	5-20	Many rare categories need shrinkage
Low-cardinality (<100 cats)	0.5-2	Categories have enough data
Noisy targets	10-50	Reduce variance from noise
Clean, high-signal data	0.5-1	Let data speak
Cold-start critical	10+	New categories get safe default

Tuning Strategy

prior_configuration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
from catboost import CatBoostClassifier, Pool
import numpy as np
import pandas as pd
 
def experiment_with_priors(X_train, y_train, X_test, y_test, cat_features):
    """
    Experiment with different prior configurations to understand their impact.
    """
    
    results = []
    
    # Prior configurations to test
    configs = [
        {'name': 'Default', 'ctr': None},
        {'name': 'Low prior weight', 'ctr': ['Borders:Prior=0.5:PriorWeight=0.5']},
        {'name': 'Medium prior weight', 'ctr': ['Borders:Prior=0.5:PriorWeight=5.0']},
        {'name': 'High prior weight', 'ctr': ['Borders:Prior=0.5:PriorWeight=20.0']},
        {'name': 'Class-adjusted prior', 'ctr': [f'Borders:Prior={y_train.mean():.3f}:PriorWeight=5.0']},
    ]
    
    for config in configs:
        model_params = {
            'iterations': 500,
            'learning_rate': 0.05,
            'depth': 6,
            'cat_features': cat_features,
            'random_seed': 42,
            'verbose': 0,
            'early_stopping_rounds': 50,
        }
        
        if config['ctr'] is not None:
            model_params['simple_ctr'] = config['ctr']
        
        model = CatBoostClassifier(**model_params)
        
        train_pool = Pool(X_train, y_train, cat_features=cat_features)
        test_pool = Pool(X_test, y_test, cat_features=cat_features)
        
        model.fit(train_pool, eval_set=test_pool, use_best_model=True)
        
        # Evaluate
        from sklearn.metrics import roc_auc_score
        probs = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, probs)
        
        results.append({
            'config': config['name'],
            'auc': auc,
            'best_iteration': model.best_iteration_
        })
        
        print(f"{config['name']:30s} AUC: {auc:.4f} (iter: {model.best_iteration_})")
    
    return pd.DataFrame(results)
 
 
def analyze_category_performance(model, X_test, y_test, cat_feature_name):
    """
    Analyze model performance stratified by category size.
    
    Helps diagnose whether prior configuration is appropriate.
    """
    from sklearn.metrics import roc_auc_score
    
    # Get predictions
    probs = model.predict_proba(X_test)[:, 1]
    
    # Count category frequencies
    cat_values = X_test[cat_feature_name]
    cat_counts = cat_values.value_counts()
    
    # Stratify by category size
    size_buckets = {
        'Rare (1-5)': (1, 5),
        'Uncommon (6-20)': (6, 20),
        'Common (21-100)': (21, 100),
        'Frequent (100+)': (101, float('inf'))
    }
    
    print(f"\nPerformance by category size for '{cat_feature_name}':")
    print("-" * 50)
    
    for bucket_name, (min_count, max_count) in size_buckets.items():
        # Find categories in this size bucket
        cats_in_bucket = cat_counts[
            (cat_counts >= min_count) & (cat_counts <= max_count)
        ].index
        
        # Filter samples to these categories
        mask = cat_values.isin(cats_in_bucket)
        n_samples = mask.sum()
        
        if n_samples > 10:  # Need enough samples for AUC
            bucket_auc = roc_auc_score(y_test[mask], probs[mask])
            print(f"  {bucket_name:20s}: AUC={bucket_auc:.4f} (n={n_samples})")
        else:
            print(f"  {bucket_name:20s}: Insufficient samples (n={n_samples})")
    
    # If rare categories perform much worse, increase prior weight
    # If common categories perform worse, prior weight may be too high

Automatic Feature Combinations

One of CatBoost's most powerful capabilities is automatically generating and evaluating combinations of categorical features. This captures interaction effects without manual feature engineering.

The Combination Mechanism

For categorical features $c^{(1)}, c^{(2)}, ..., c^{(k)}$, CatBoost creates combined features:

$$c^{\text{combined}} = (c^{(1)}_i, c^{(2)}_i, ..., c^{(k)}_i)$$

The combined feature is treated as a new categorical with cardinality up to $|\mathcal{C}_1| \times |\mathcal{C}_2| \times ... \times |\mathcal{C}_k|$.

Target statistics are then computed for this combined category in the same ordered manner:

$$TS^{\text{combined}}i = \frac{\sum{j \prec i, c^{\text{combined}}_j = c^{\text{combined}}i} y_j + a \cdot P}{\sum{j \prec i, c^{\text{combined}}_j = c^{\text{combined}}_i} 1 + a}$$

Why Combinations Matter

Consider predicting purchase intent with features: Country, Browser, Device

Individual effects:

Country: Regional purchasing patterns
Browser: Tech-savvy indicators
Device: Wealth/platform preferences

Interaction effects (combinations capture):

(Country, Device): iPhone users in Japan vs USA
(Browser, Device): Safari-on-iPhone vs Chrome-on-Android behaviors
(Country, Browser, Device): Rich 3-way interaction patterns

Without combinations, the model must approximately learn these interactions through complex tree structures. With combinations, they become direct features.

The Sparsity Challenge

Controlling Combination Complexity

The max_ctr_complexity parameter limits the maximum number of features combined:

max_ctr_complexity	Combinations Generated	Cardinality Explosion	Training Time
1	Individual only	None	Fastest
2	Pairs	$O(k^2)$	Moderate
3	Triples	$O(k^3)$	Slower
4 (default)	Quadruples	$O(k^4)$	Significant

Recommendation: Start with 2 or 3. Only increase if validation shows benefit and training time is acceptable.

Selective Combination Configuration

Not all feature pairs benefit equally from combination. Use allowed_ctr_feature_combinations to specify which pairs to combine:

# Only combine feature pairs that make domain sense
allowed_combos = [
    (0, 1),  # Combine features at index 0 and 1
    (1, 2),  # Combine features at index 1 and 2
    # Skip (0, 2) if that combination isn't meaningful
]

feature_combinations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
from catboost import CatBoostClassifier, Pool
import numpy as np
import pandas as pd
 
def analyze_feature_combinations(model, pool):
    """
    Analyze which feature combinations CatBoost found most useful.
    """
    # Get feature importances
    importances = model.get_feature_importance(pool, type='PredictionValuesChange')
    feature_names = model.feature_names_
    
    # Categorize by combination complexity
    single_features = []
    pair_combinations = []
    higher_order = []
    
    for name, imp in zip(feature_names, importances):
        if '{' not in name:
            single_features.append((name, imp))
        else:
            # Count features in combination by counting commas
            n_features = name.count(',') + 1
            if n_features == 2:
                pair_combinations.append((name, imp))
            else:
                higher_order.append((name, imp))
    
    print("Feature Importance by Combination Level")
    print("=" * 60)
    
    # Aggregate by level
    single_total = sum(imp for _, imp in single_features)
    pair_total = sum(imp for _, imp in pair_combinations)
    higher_total = sum(imp for _, imp in higher_order)
    total = single_total + pair_total + higher_total
    
    print(f"\nImportance Distribution:")
    print(f"  Single features:    {100*single_total/total:5.1f}%")
    print(f"  Pair combinations:  {100*pair_total/total:5.1f}%")
    print(f"  Higher-order:       {100*higher_total/total:5.1f}%")
    
    print(f"\nTop Single Features:")
    for name, imp in sorted(single_features, key=lambda x: -x[1])[:5]:
        print(f"  {name}: {imp:.3f}")
    
    print(f"\nTop Pair Combinations:")
    for name, imp in sorted(pair_combinations, key=lambda x: -x[1])[:5]:
        print(f"  {name}: {imp:.3f}")
    
    return {
        'single': single_features,
        'pairs': pair_combinations,
        'higher': higher_order
    }
 
 
def compare_combination_depths():
    """
    Compare model performance with different max_ctr_complexity settings.
    """
    np.random.seed(42)
    n_samples = 10000
    
    # Create data with genuine interaction effects
    country = np.random.choice(['US', 'UK', 'DE', 'JP', 'FR'], n_samples)
    device = np.random.choice(['mobile', 'desktop', 'tablet'], n_samples)
    browser = np.random.choice(['chrome', 'safari', 'firefox', 'edge'], n_samples)
    
    # Target depends on interactions
    # US + mobile has high conversion; JP + desktop has high conversion
    base_prob = 0.1
    interaction_bonus = np.where(
        ((country == 'US') & (device == 'mobile')) |
        ((country == 'JP') & (device == 'desktop')),
        0.4, 0.0
    )
    probs = base_prob + interaction_bonus
    y = (np.random.rand(n_samples) < probs).astype(int)
    
    df = pd.DataFrame({
        'country': country,
        'device': device,
        'browser': browser,
        'target': y
    })
    
    # Split
    train_df = df.iloc[:8000]
    test_df = df.iloc[8000:]
    
    cat_features = ['country', 'device', 'browser']
    
    results = []
    for complexity in [1, 2, 3, 4]:
        model = CatBoostClassifier(
            iterations=300,
            learning_rate=0.1,
            depth=4,
            cat_features=cat_features,
            max_ctr_complexity=complexity,
            random_seed=42,
            verbose=0,
        )
        
        train_pool = Pool(
            train_df.drop('target', axis=1),
            train_df['target'],
            cat_features=cat_features
        )
        test_pool = Pool(
            test_df.drop('target', axis=1),
            test_df['target'],
            cat_features=cat_features
        )
        
        model.fit(train_pool)
        
        from sklearn.metrics import roc_auc_score
        probs = model.predict_proba(test_df.drop('target', axis=1))[:, 1]
        auc = roc_auc_score(test_df['target'], probs)
        
        results.append({
            'max_ctr_complexity': complexity,
            'auc': auc,
            'n_features': len(model.feature_names_)
        })
        
        print(f"Complexity {complexity}: AUC={auc:.4f}, Features={len(model.feature_names_)}")
    
    return pd.DataFrame(results)
 
# Run comparison
print("\nComparison of max_ctr_complexity settings:")
print("-" * 50)
compare_combination_depths()

Target Statistics for Regression Problems

While target statistics were developed for classification (estimating probabilities), they extend naturally to regression with some modifications.

Direct Mean Encoding for Regression

The simplest approach: compute ordered means of continuous targets:

$$TS^{\text{reg}}i = \frac{\sum{j \prec i, c_j = c_i} y_j + a \cdot \mu}{\sum_{j \prec i, c_j = c_i} 1 + a}$$

Where $\mu$ is the global target mean (prior). This works well when:

Targets are roughly normally distributed
Category effects are approximately additive
Target scale is consistent across categories

Quantile-Based Target Statistics

For skewed or heavy-tailed targets, binarizing at multiple quantiles captures more information:

Compute target quantiles $\tau_{0.25}, \tau_{0.5}, \tau_{0.75}$
For each threshold $\tau$, compute: $$TS^{\tau}_i = P(Y > \tau | \text{category}, \text{preceding})$$
Resulting features: multiple binary indicators of target level

This approach:

Handles outliers gracefully
Captures non-linear category effects
Provides richer signal for complex distributions

When to Use Quantile Binarization

Configuration for Regression

model = CatBoostRegressor(
    simple_ctr=[
        # Mean encoding with shrinkage
        'Borders:TargetBorderCount=1:PriorWeight=5.0',
        
        # Quantile binarization for robustness
        'Borders:TargetBorderCount=3:PriorWeight=2.0',
    ],
    combinations_ctr=[
        # Use more quantiles for combinations (more data needed)
        'Borders:TargetBorderCount=2:PriorWeight=10.0',
    ],
)

Target Normalization Considerations

For target statistics to work well in regression:

Scale targets appropriately: Raw targets in millions vs. normalized [0,1] affects optimal prior weight
Handle zeros specially: If many targets are zero (e.g., sales forecasting), consider log-transform or separate zero indicator
Clip outliers: Extreme targets can destabilize mean computations

regression_target_statistics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
from catboost import CatBoostRegressor, Pool
import numpy as np
import pandas as pd
 
def configure_regression_ctr(target_distribution='normal'):
    """
    Configure CTR settings appropriate for different target distributions.
    """
    
    if target_distribution == 'normal':
        # Standard normally distributed targets
        return CatBoostRegressor(
            simple_ctr=[
                'Borders:TargetBorderCount=1:PriorWeight=3.0',
            ],
        )
    
    elif target_distribution == 'lognormal':
        # Right-skewed targets (e.g., prices, durations)
        return CatBoostRegressor(
            simple_ctr=[
                # Mean encoding
                'Borders:TargetBorderCount=1:PriorWeight=5.0',
                # Quantile encoding for tail behavior
                'Borders:TargetBorderCount=4:PriorWeight=3.0',
            ],
        )
    
    elif target_distribution == 'heavy_tailed':
        # Targets with outliers (e.g., claims amounts)
        return CatBoostRegressor(
            simple_ctr=[
                # Multiple quantiles for robust estimation
                'Borders:TargetBorderCount=5:PriorWeight=5.0',
                # Counter for frequency signal
                'Counter',
            ],
            loss_function='MAE',  # Also consider robust loss
        )
    
    elif target_distribution == 'zero_inflated':
        # Many zeros (e.g., purchase amounts, rare events)
        return CatBoostRegressor(
            simple_ctr=[
                # Low threshold captures P(y > 0)
                'Borders:TargetBorderCount=1:PriorWeight=2.0',
                # Higher thresholds for conditional distribution
                'Borders:TargetBorderCount=3:PriorWeight=5.0',
            ],
        )
 
 
def demonstrate_quantile_ctr():
    """
    Show how TargetBorderCount affects what CatBoost learns.
    """
    np.random.seed(42)
    n_samples = 5000
    
    # Create data with heterogeneous category effects
    # Category A: higher mean, lower variance
    # Category B: lower mean, higher variance
    categories = np.random.choice(['A', 'B', 'C', 'D'], n_samples)
    
    targets = np.zeros(n_samples)
    for i, cat in enumerate(categories):
        if cat == 'A':
            targets[i] = np.random.normal(100, 10)  # High mean, low variance
        elif cat == 'B':
            targets[i] = np.random.normal(50, 30)   # Low mean, high variance
        elif cat == 'C':
            targets[i] = np.random.exponential(30)  # Skewed
        else:
            targets[i] = np.random.normal(70, 20)   # Medium
    
    df = pd.DataFrame({'category': categories, 'target': targets})
    
    # Compare different TargetBorderCount settings
    results = {}
    for n_borders in [1, 3, 5]:
        model = CatBoostRegressor(
            iterations=200,
            learning_rate=0.1,
            depth=4,
            cat_features=['category'],
            simple_ctr=[f'Borders:TargetBorderCount={n_borders}:PriorWeight=3.0'],
            random_seed=42,
            verbose=0,
        )
        
        # Simple train/test split
        train_df = df.iloc[:4000]
        test_df = df.iloc[4000:]
        
        train_pool = Pool(
            train_df[['category']], train_df['target'], cat_features=['category']
        )
        test_pool = Pool(
            test_df[['category']], test_df['target'], cat_features=['category']
        )
        
        model.fit(train_pool)
        preds = model.predict(test_df[['category']])
        
        from sklearn.metrics import mean_squared_error, mean_absolute_error
        rmse = np.sqrt(mean_squared_error(test_df['target'], preds))
        mae = mean_absolute_error(test_df['target'], preds)
        
        results[n_borders] = {'rmse': rmse, 'mae': mae}
        print(f"TargetBorderCount={n_borders}: RMSE={rmse:.2f}, MAE={mae:.2f}")
    
    return results
 
print("Demonstrating quantile-based CTR for heterogeneous data:")
print("-" * 60)
demonstrate_quantile_ctr()

Ordering and Permutation Strategies

The choice and number of permutations affects both the statistical properties and computational cost of target statistics.

Why Multiple Permutations?

A single permutation introduces position-dependent behavior:

Early samples see empty categories → encoding is just prior
Late samples see all preceding data → encoding is most informative
This creates systematic differences based on position

Multiple permutations smooth this:

Different samples are "early" and "late" in different orderings
Averaging predictions across permutations reduces variance
Statistical properties become more uniform

CatBoost's Permutation Strategy

CatBoost maintains multiple permutation-specific models during training:

For structure learning: Uses one permutation to determine tree splits
For leaf values: Uses different permutations to compute leaf values
At inference: Averages across permutation-specific predictions

Default: 4 permutations. Configurable via permutation_count parameter.

Impact of Number of Permutations
Permutations	Variance	Training Time	Memory	Recommended Use
1	High (position-dependent)	Fastest	Lowest	Quick experiments only
4 (default)	Moderate	Moderate	Moderate	Production default
8	Low	~2x baseline	~2x baseline	High-stakes, small data
16	Very low	~4x baseline	~4x baseline	Research, maximum quality

Seed and Reproducibility

For reproducible target statistics:

model = CatBoostClassifier(
    random_seed=42,  # Seeds all randomness including permutations
)

With the same seed:

Same permutations are generated
Same target statistics are computed
Model is fully reproducible

Time-Aware Ordering

For time-series or temporal data, consider ordered boosting modes:

Plain: Standard random permutation (default)
Ordered: Time-ordered for temporal leakage prevention
Custom: User-provided ordering

model = CatBoostClassifier(
    has_time=True,  # Interpret first column as timestamp
    boosting_type='Ordered',
)

With has_time=True, samples are ordered chronologically for target statistics, preventing future data from leaking into past predictions.

Temporal Leakage in Target Statistics

Summary and Key Takeaways

Key Takeaways

•Target statistics have Bayesian interpretation: The formula computes a posterior mean, blending observed data with prior beliefs.
•Multiple CTR types exist: Borders (default), Buckets, Counter, and BinarizedTargetMeanValue serve different purposes.
•Prior configuration is high-leverage: Prior value $P$ sets the fallback; prior weight $a$ controls shrinkage strength.
•Automatic feature combinations capture interactions: CatBoost generates pairs, triples, etc., controlled by max_ctr_complexity.
•Regression requires special consideration: Use TargetBorderCount > 1 for skewed or heterogeneous targets.
•Multiple permutations reduce variance: Default 4 permutations balance quality and computational cost.

Looking Ahead

Next, we'll explore CatBoost's symmetric (oblivious) decision trees—the structural innovation that makes ordered boosting computationally efficient and enables extremely fast inference.

3 / 5