Loading learning content...
Target statistics (often called CTR from "Click-Through Rate" in CatBoost's advertising origins) are the mathematical mechanisms that transform categorical features into numerical representations suitable for gradient boosting. While we introduced the basic concept in the previous section, this page dives deep into the mathematical foundations, variants, and optimization strategies.
Understanding target statistics at this depth enables practitioners to:
Target statistics configuration is one of the highest-leverage tuning areas in CatBoost. Small changes to CTR parameters can significantly impact model performance, especially on categorical-heavy datasets. This page provides the foundation for principled tuning rather than trial-and-error.
Let's formalize target statistics rigorously. Consider a dataset $\mathcal{D} = {(x_1, y_1), ..., (x_n, y_n)}$ with a categorical feature $c$ taking values in $\mathcal{C} = {c_1, ..., c_k}$.
The Core Formulation
For a permutation $\sigma$ of training indices, the ordered target statistic for sample $i$ is:
$$TS^{\sigma}i = \frac{\sum{j: \sigma(j) < \sigma(i), c_j = c_i} w_j \cdot g(y_j) + a \cdot P}{\sum_{j: \sigma(j) < \sigma(i), c_j = c_i} w_j + a}$$
Where:
Interpretation as Bayesian Posterior
The formula has an elegant Bayesian interpretation. If we model the target probability for category $c$ as coming from a Beta distribution with prior $\text{Beta}(a \cdot P, a \cdot (1-P))$:
$$P(\theta_c | \text{data}) \propto P(\text{data} | \theta_c) \cdot P(\theta_c)$$
Then $TS^\sigma_i$ is the posterior mean after observing the preceding samples:
$$\mathbb{E}[\theta_c | \text{preceding samples}] = TS^\sigma_i$$
This Bayesian view explains why the formula works:
Viewing target statistics as Bayesian posteriors provides intuition for parameter tuning. Higher prior weight $a$ means stronger initial beliefs that require more evidence to overcome. The prior $P$ should reflect your base rate expectation for unpopular categories.
Variance Analysis
The variance of target statistics depends on:
For $n_c$ preceding samples in category $c$ with variance $\sigma^2_c$:
$$\text{Var}(TS^\sigma_i) \approx \frac{\sigma^2_c}{n_c + a} + O\left(\frac{1}{(n_c + a)^2}\right)$$
Key implications:
| Category Size (n_c) | Prior Weight a=1 | Prior Weight a=5 | Prior Weight a=10 |
|---|---|---|---|
| 1 | σ²/2 = 0.125 | σ²/6 = 0.042 | σ²/11 = 0.023 |
| 5 | σ²/6 = 0.042 | σ²/10 = 0.025 | σ²/15 = 0.017 |
| 20 | σ²/21 = 0.012 | σ²/25 = 0.010 | σ²/30 = 0.008 |
| 100 | σ²/101 = 0.0025 | σ²/105 = 0.0024 | σ²/110 = 0.0023 |
CatBoost provides multiple target statistic types, each optimized for different scenarios. Understanding when to use each type is crucial for optimal model performance.
1. Borders (Default)
The standard target mean statistic, discretized using histogram borders:
$$TS_{\text{Borders}}(i) = \text{Bin}\left(\frac{\sum_{j \prec i} y_j + a \cdot P}{\sum_{j \prec i} 1 + a}\right)$$
Where $\text{Bin}(\cdot)$ maps the continuous value to one of $B$ histogram bins.
Use cases:
Key parameters:
TargetBorderCount: Number of target binarizations (1 for binary classification)CtrBorderCount: Number of CTR discretization bins (default 15)2. Buckets
Instead of soft CTR values, categories are mapped to integer bucket indices:
$$TS_{\text{Buckets}}(i) = \min\left(B, \left\lfloor B \cdot \frac{\sum_{j \prec i} y_j + a \cdot P}{\sum_{j \prec i} 1 + a} \right\rfloor\right)$$
Use cases:
3. Counter
Uses raw category counts instead of target means:
$$TS_{\text{Counter}}(i) = \log_2(\text{count}_{c_i} + 1)$$
Where $\text{count}_{c_i}$ is the number of preceding samples with category $c_i$.
Use cases:
4. BinarizedTargetMeanValue
For regression tasks, binarizes the target at multiple thresholds and computes means:
$$TS_{\text{BTMV}}(i, b) = \frac{\sum_{j \prec i} \mathbf{1}[y_j > \tau_b] + a \cdot P_b}{\sum_{j \prec i} 1 + a}$$
Where $\tau_1, ..., \tau_B$ are target thresholds.
Use cases:
5. FeatureFreq
Categorical feature frequency in the dataset:
$$TS_{\text{FeatureFreq}}(i) = \frac{\text{count}_{c_i}}{n}$$
Use cases:
| Type | Output | Leakage-Free | Best For | Memory |
|---|---|---|---|---|
| Borders | Binned mean | Yes (ordered) | Classification, general | Moderate |
| Buckets | Bucket index | Yes (ordered) | Clean categories | Low |
| Counter | Log count | Yes (ordered) | Frequency signals | Low |
| BinarizedTargetMeanValue | Multiple binned means | Yes (ordered) | Regression | Higher |
| FeatureFreq | Frequency ratio | Yes (no target) | Popularity | Low |
The prior $(P, a)$ represents our default belief about category effects before observing data. Getting this right is crucial for handling rare categories and cold-start scenarios.
Prior Value (P)
The prior value $P$ is the fallback encoding for categories with no preceding observations:
Prior=0.5 (or any value in [0,1] for classification)Guidance for setting P:
| Scenario | Recommended P | Reasoning |
|---|---|---|
| Balanced classification | 0.5 | Neutral default |
| Imbalanced (90% positive) | 0.9 | Matches base rate |
| Rare positive events | 0.01-0.1 | Conservative; avoids overconfidence |
| Regression (normalized) | 0.0 or mean | Center of target distribution |
Prior Weight (a)
The prior weight $a$ determines regularization strength—how much data is needed to overcome the prior:
Guidance for setting a:
| Scenario | Recommended a | Reasoning |
|---|---|---|
| High-cardinality (>1000 cats) | 5-20 | Many rare categories need shrinkage |
| Low-cardinality (<100 cats) | 0.5-2 | Categories have enough data |
| Noisy targets | 10-50 | Reduce variance from noise |
| Clean, high-signal data | 0.5-1 | Let data speak |
| Cold-start critical | 10+ | New categories get safe default |
Start with default priors and tune only if validation performance suggests issues. Signs of wrong prior weight: (1) rare categories perform much worse than common ones → increase a, (2) model ignores clear category signals → decrease a. Use category-stratified validation to detect these patterns.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
from catboost import CatBoostClassifier, Poolimport numpy as npimport pandas as pd def experiment_with_priors(X_train, y_train, X_test, y_test, cat_features): """ Experiment with different prior configurations to understand their impact. """ results = [] # Prior configurations to test configs = [ {'name': 'Default', 'ctr': None}, {'name': 'Low prior weight', 'ctr': ['Borders:Prior=0.5:PriorWeight=0.5']}, {'name': 'Medium prior weight', 'ctr': ['Borders:Prior=0.5:PriorWeight=5.0']}, {'name': 'High prior weight', 'ctr': ['Borders:Prior=0.5:PriorWeight=20.0']}, {'name': 'Class-adjusted prior', 'ctr': [f'Borders:Prior={y_train.mean():.3f}:PriorWeight=5.0']}, ] for config in configs: model_params = { 'iterations': 500, 'learning_rate': 0.05, 'depth': 6, 'cat_features': cat_features, 'random_seed': 42, 'verbose': 0, 'early_stopping_rounds': 50, } if config['ctr'] is not None: model_params['simple_ctr'] = config['ctr'] model = CatBoostClassifier(**model_params) train_pool = Pool(X_train, y_train, cat_features=cat_features) test_pool = Pool(X_test, y_test, cat_features=cat_features) model.fit(train_pool, eval_set=test_pool, use_best_model=True) # Evaluate from sklearn.metrics import roc_auc_score probs = model.predict_proba(X_test)[:, 1] auc = roc_auc_score(y_test, probs) results.append({ 'config': config['name'], 'auc': auc, 'best_iteration': model.best_iteration_ }) print(f"{config['name']:30s} AUC: {auc:.4f} (iter: {model.best_iteration_})") return pd.DataFrame(results) def analyze_category_performance(model, X_test, y_test, cat_feature_name): """ Analyze model performance stratified by category size. Helps diagnose whether prior configuration is appropriate. """ from sklearn.metrics import roc_auc_score # Get predictions probs = model.predict_proba(X_test)[:, 1] # Count category frequencies cat_values = X_test[cat_feature_name] cat_counts = cat_values.value_counts() # Stratify by category size size_buckets = { 'Rare (1-5)': (1, 5), 'Uncommon (6-20)': (6, 20), 'Common (21-100)': (21, 100), 'Frequent (100+)': (101, float('inf')) } print(f"\nPerformance by category size for '{cat_feature_name}':") print("-" * 50) for bucket_name, (min_count, max_count) in size_buckets.items(): # Find categories in this size bucket cats_in_bucket = cat_counts[ (cat_counts >= min_count) & (cat_counts <= max_count) ].index # Filter samples to these categories mask = cat_values.isin(cats_in_bucket) n_samples = mask.sum() if n_samples > 10: # Need enough samples for AUC bucket_auc = roc_auc_score(y_test[mask], probs[mask]) print(f" {bucket_name:20s}: AUC={bucket_auc:.4f} (n={n_samples})") else: print(f" {bucket_name:20s}: Insufficient samples (n={n_samples})") # If rare categories perform much worse, increase prior weight # If common categories perform worse, prior weight may be too highOne of CatBoost's most powerful capabilities is automatically generating and evaluating combinations of categorical features. This captures interaction effects without manual feature engineering.
The Combination Mechanism
For categorical features $c^{(1)}, c^{(2)}, ..., c^{(k)}$, CatBoost creates combined features:
$$c^{\text{combined}} = (c^{(1)}_i, c^{(2)}_i, ..., c^{(k)}_i)$$
The combined feature is treated as a new categorical with cardinality up to $|\mathcal{C}_1| \times |\mathcal{C}_2| \times ... \times |\mathcal{C}_k|$.
Target statistics are then computed for this combined category in the same ordered manner:
$$TS^{\text{combined}}i = \frac{\sum{j \prec i, c^{\text{combined}}_j = c^{\text{combined}}i} y_j + a \cdot P}{\sum{j \prec i, c^{\text{combined}}_j = c^{\text{combined}}_i} 1 + a}$$
Why Combinations Matter
Consider predicting purchase intent with features: Country, Browser, Device
Individual effects:
Interaction effects (combinations capture):
Without combinations, the model must approximately learn these interactions through complex tree structures. With combinations, they become direct features.
Combinations increase cardinality multiplicatively. For Country (200) × Browser (5) × Device (3), you get up to 3,000 combined categories. Most will have few samples, making prior weight $a$ even more important. CatBoost handles this automatically but watch for overfitting on rare combinations.
Controlling Combination Complexity
The max_ctr_complexity parameter limits the maximum number of features combined:
| max_ctr_complexity | Combinations Generated | Cardinality Explosion | Training Time |
|---|---|---|---|
| 1 | Individual only | None | Fastest |
| 2 | Pairs | $O(k^2)$ | Moderate |
| 3 | Triples | $O(k^3)$ | Slower |
| 4 (default) | Quadruples | $O(k^4)$ | Significant |
Recommendation: Start with 2 or 3. Only increase if validation shows benefit and training time is acceptable.
Selective Combination Configuration
Not all feature pairs benefit equally from combination. Use allowed_ctr_feature_combinations to specify which pairs to combine:
# Only combine feature pairs that make domain sense
allowed_combos = [
(0, 1), # Combine features at index 0 and 1
(1, 2), # Combine features at index 1 and 2
# Skip (0, 2) if that combination isn't meaningful
]
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136
from catboost import CatBoostClassifier, Poolimport numpy as npimport pandas as pd def analyze_feature_combinations(model, pool): """ Analyze which feature combinations CatBoost found most useful. """ # Get feature importances importances = model.get_feature_importance(pool, type='PredictionValuesChange') feature_names = model.feature_names_ # Categorize by combination complexity single_features = [] pair_combinations = [] higher_order = [] for name, imp in zip(feature_names, importances): if '{' not in name: single_features.append((name, imp)) else: # Count features in combination by counting commas n_features = name.count(',') + 1 if n_features == 2: pair_combinations.append((name, imp)) else: higher_order.append((name, imp)) print("Feature Importance by Combination Level") print("=" * 60) # Aggregate by level single_total = sum(imp for _, imp in single_features) pair_total = sum(imp for _, imp in pair_combinations) higher_total = sum(imp for _, imp in higher_order) total = single_total + pair_total + higher_total print(f"\nImportance Distribution:") print(f" Single features: {100*single_total/total:5.1f}%") print(f" Pair combinations: {100*pair_total/total:5.1f}%") print(f" Higher-order: {100*higher_total/total:5.1f}%") print(f"\nTop Single Features:") for name, imp in sorted(single_features, key=lambda x: -x[1])[:5]: print(f" {name}: {imp:.3f}") print(f"\nTop Pair Combinations:") for name, imp in sorted(pair_combinations, key=lambda x: -x[1])[:5]: print(f" {name}: {imp:.3f}") return { 'single': single_features, 'pairs': pair_combinations, 'higher': higher_order } def compare_combination_depths(): """ Compare model performance with different max_ctr_complexity settings. """ np.random.seed(42) n_samples = 10000 # Create data with genuine interaction effects country = np.random.choice(['US', 'UK', 'DE', 'JP', 'FR'], n_samples) device = np.random.choice(['mobile', 'desktop', 'tablet'], n_samples) browser = np.random.choice(['chrome', 'safari', 'firefox', 'edge'], n_samples) # Target depends on interactions # US + mobile has high conversion; JP + desktop has high conversion base_prob = 0.1 interaction_bonus = np.where( ((country == 'US') & (device == 'mobile')) | ((country == 'JP') & (device == 'desktop')), 0.4, 0.0 ) probs = base_prob + interaction_bonus y = (np.random.rand(n_samples) < probs).astype(int) df = pd.DataFrame({ 'country': country, 'device': device, 'browser': browser, 'target': y }) # Split train_df = df.iloc[:8000] test_df = df.iloc[8000:] cat_features = ['country', 'device', 'browser'] results = [] for complexity in [1, 2, 3, 4]: model = CatBoostClassifier( iterations=300, learning_rate=0.1, depth=4, cat_features=cat_features, max_ctr_complexity=complexity, random_seed=42, verbose=0, ) train_pool = Pool( train_df.drop('target', axis=1), train_df['target'], cat_features=cat_features ) test_pool = Pool( test_df.drop('target', axis=1), test_df['target'], cat_features=cat_features ) model.fit(train_pool) from sklearn.metrics import roc_auc_score probs = model.predict_proba(test_df.drop('target', axis=1))[:, 1] auc = roc_auc_score(test_df['target'], probs) results.append({ 'max_ctr_complexity': complexity, 'auc': auc, 'n_features': len(model.feature_names_) }) print(f"Complexity {complexity}: AUC={auc:.4f}, Features={len(model.feature_names_)}") return pd.DataFrame(results) # Run comparisonprint("\nComparison of max_ctr_complexity settings:")print("-" * 50)compare_combination_depths()While target statistics were developed for classification (estimating probabilities), they extend naturally to regression with some modifications.
Direct Mean Encoding for Regression
The simplest approach: compute ordered means of continuous targets:
$$TS^{\text{reg}}i = \frac{\sum{j \prec i, c_j = c_i} y_j + a \cdot \mu}{\sum_{j \prec i, c_j = c_i} 1 + a}$$
Where $\mu$ is the global target mean (prior). This works well when:
Quantile-Based Target Statistics
For skewed or heavy-tailed targets, binarizing at multiple quantiles captures more information:
This approach:
Use TargetBorderCount > 1 when: (1) targets have outliers or long tails, (2) you suspect category effects differ across the target distribution (e.g., category affects P(y > median) differently than P(y > 90th percentile)), (3) you're solving heterogeneous regression problems.
Configuration for Regression
model = CatBoostRegressor(
simple_ctr=[
# Mean encoding with shrinkage
'Borders:TargetBorderCount=1:PriorWeight=5.0',
# Quantile binarization for robustness
'Borders:TargetBorderCount=3:PriorWeight=2.0',
],
combinations_ctr=[
# Use more quantiles for combinations (more data needed)
'Borders:TargetBorderCount=2:PriorWeight=10.0',
],
)
Target Normalization Considerations
For target statistics to work well in regression:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
from catboost import CatBoostRegressor, Poolimport numpy as npimport pandas as pd def configure_regression_ctr(target_distribution='normal'): """ Configure CTR settings appropriate for different target distributions. """ if target_distribution == 'normal': # Standard normally distributed targets return CatBoostRegressor( simple_ctr=[ 'Borders:TargetBorderCount=1:PriorWeight=3.0', ], ) elif target_distribution == 'lognormal': # Right-skewed targets (e.g., prices, durations) return CatBoostRegressor( simple_ctr=[ # Mean encoding 'Borders:TargetBorderCount=1:PriorWeight=5.0', # Quantile encoding for tail behavior 'Borders:TargetBorderCount=4:PriorWeight=3.0', ], ) elif target_distribution == 'heavy_tailed': # Targets with outliers (e.g., claims amounts) return CatBoostRegressor( simple_ctr=[ # Multiple quantiles for robust estimation 'Borders:TargetBorderCount=5:PriorWeight=5.0', # Counter for frequency signal 'Counter', ], loss_function='MAE', # Also consider robust loss ) elif target_distribution == 'zero_inflated': # Many zeros (e.g., purchase amounts, rare events) return CatBoostRegressor( simple_ctr=[ # Low threshold captures P(y > 0) 'Borders:TargetBorderCount=1:PriorWeight=2.0', # Higher thresholds for conditional distribution 'Borders:TargetBorderCount=3:PriorWeight=5.0', ], ) def demonstrate_quantile_ctr(): """ Show how TargetBorderCount affects what CatBoost learns. """ np.random.seed(42) n_samples = 5000 # Create data with heterogeneous category effects # Category A: higher mean, lower variance # Category B: lower mean, higher variance categories = np.random.choice(['A', 'B', 'C', 'D'], n_samples) targets = np.zeros(n_samples) for i, cat in enumerate(categories): if cat == 'A': targets[i] = np.random.normal(100, 10) # High mean, low variance elif cat == 'B': targets[i] = np.random.normal(50, 30) # Low mean, high variance elif cat == 'C': targets[i] = np.random.exponential(30) # Skewed else: targets[i] = np.random.normal(70, 20) # Medium df = pd.DataFrame({'category': categories, 'target': targets}) # Compare different TargetBorderCount settings results = {} for n_borders in [1, 3, 5]: model = CatBoostRegressor( iterations=200, learning_rate=0.1, depth=4, cat_features=['category'], simple_ctr=[f'Borders:TargetBorderCount={n_borders}:PriorWeight=3.0'], random_seed=42, verbose=0, ) # Simple train/test split train_df = df.iloc[:4000] test_df = df.iloc[4000:] train_pool = Pool( train_df[['category']], train_df['target'], cat_features=['category'] ) test_pool = Pool( test_df[['category']], test_df['target'], cat_features=['category'] ) model.fit(train_pool) preds = model.predict(test_df[['category']]) from sklearn.metrics import mean_squared_error, mean_absolute_error rmse = np.sqrt(mean_squared_error(test_df['target'], preds)) mae = mean_absolute_error(test_df['target'], preds) results[n_borders] = {'rmse': rmse, 'mae': mae} print(f"TargetBorderCount={n_borders}: RMSE={rmse:.2f}, MAE={mae:.2f}") return results print("Demonstrating quantile-based CTR for heterogeneous data:")print("-" * 60)demonstrate_quantile_ctr()The choice and number of permutations affects both the statistical properties and computational cost of target statistics.
Why Multiple Permutations?
A single permutation introduces position-dependent behavior:
Multiple permutations smooth this:
CatBoost's Permutation Strategy
CatBoost maintains multiple permutation-specific models during training:
Default: 4 permutations. Configurable via permutation_count parameter.
| Permutations | Variance | Training Time | Memory | Recommended Use |
|---|---|---|---|---|
| 1 | High (position-dependent) | Fastest | Lowest | Quick experiments only |
| 4 (default) | Moderate | Moderate | Moderate | Production default |
| 8 | Low | ~2x baseline | ~2x baseline | High-stakes, small data |
| 16 | Very low | ~4x baseline | ~4x baseline | Research, maximum quality |
Seed and Reproducibility
For reproducible target statistics:
model = CatBoostClassifier(
random_seed=42, # Seeds all randomness including permutations
)
With the same seed:
Time-Aware Ordering
For time-series or temporal data, consider ordered boosting modes:
model = CatBoostClassifier(
has_time=True, # Interpret first column as timestamp
boosting_type='Ordered',
)
With has_time=True, samples are ordered chronologically for target statistics, preventing future data from leaking into past predictions.
For time-series data, random permutations allow future targets to influence past encodings. Set has_time=True to enforce chronological ordering. This is critical for forecasting applications where temporal integrity must be preserved.
Target statistics are the mathematical engine that powers CatBoost's categorical feature handling. Understanding their mechanics enables practitioners to tune models optimally for their specific data characteristics.
Next, we'll explore CatBoost's symmetric (oblivious) decision trees—the structural innovation that makes ordered boosting computationally efficient and enables extremely fast inference.