Loading learning content...
Target encoding (also called mean encoding or likelihood encoding) is one of the most powerful techniques for high-cardinality categorical features. Instead of creating sparse indicator vectors, it encodes each category with the mean of the target variable for that category.
For a binary classification where target ∈ {0, 1}:
Category Samples Target Mean Encoded Value
'category_A' 500 0.72 0.72
'category_B' 1200 0.31 0.31
'category_C' 89 0.85 0.85
This elegantly compresses any cardinality to a single informative column. A feature with 1 million categories becomes 1 column instead of 1 million.
Naive target encoding creates severe data leakage—each row's encoding incorporates information from that same row's target. This inflates training metrics dramatically while failing in production. Proper regularization and cross-validation schemes are mandatory.
Basic Target Encoding Formula:
For category c in feature X, the naive target encoding is:
$$TE(c) = \frac{\sum_{i: X_i = c} y_i}{n_c} = \bar{y}_c$$
Where:
The Variance Problem:
Rare categories pose a critical challenge. If category rare_cat appears only twice with targets [1, 0], its encoding is 0.5—but this estimate has enormous variance. A single different sample could swing it to 0.0 or 1.0.
Regularized Target Encoding (Smoothing):
To handle rare categories, we blend the category mean with the global mean:
$$TE_{smooth}(c) = \frac{n_c \cdot \bar{y}c + m \cdot \bar{y}{global}}{n_c + m}$$
Where $m$ is the smoothing parameter (regularization strength). When $n_c >> m$, the encoding approaches the category mean. When $n_c << m$, it approaches the global mean.
| Category | Count (n_c) | Category Mean | Global Mean=0.4 | m=1 | m=10 | m=100 |
|---|---|---|---|---|---|---|
| common | 1000 | 0.72 | 0.4 | 0.720 | 0.717 | 0.691 |
| medium | 100 | 0.65 | 0.4 | 0.648 | 0.627 | 0.525 |
| rare | 10 | 0.80 | 0.4 | 0.764 | 0.600 | 0.436 |
| very_rare | 2 | 1.00 | 0.4 | 0.800 | 0.500 | 0.412 |
Common heuristics: m=10-100 for most problems. Cross-validate to find optimal m. Higher m means more regularization—better for very high cardinality or small datasets. Some implementations use m = variance(y) / variance(category_means) as an adaptive choice.
Why Naive Target Encoding Leaks:
When computing target encoding on training data, each sample's encoding uses statistics that include its own target. For rare categories, this is catastrophic—a category with one sample gets encoded as exactly its target value, providing a perfect (but useless) predictor.
Leave-One-Out (LOO) Encoding:
Exclude the current sample when computing the category statistic:
$$TE_{LOO}(x_i) = \frac{\sum_{j eq i, X_j = X_i} y_j}{n_{X_i} - 1}$$
This prevents direct leakage but still allows indirect leakage through correlated samples.
K-Fold Target Encoding (Recommended):
The gold standard approach mirrors cross-validation:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import pandas as pdimport numpy as npfrom sklearn.model_selection import KFold class KFoldTargetEncoder: """K-Fold target encoding with smoothing to prevent leakage.""" def __init__(self, cols, n_folds=5, smoothing=10): self.cols = cols self.n_folds = n_folds self.smoothing = smoothing self.global_mean_ = None self.encoding_maps_ = {} def fit_transform(self, X, y): """Fit on training data using K-fold scheme.""" X = X.copy() y = np.array(y) self.global_mean_ = y.mean() kf = KFold(n_splits=self.n_folds, shuffle=True, random_state=42) for col in self.cols: X[f'{col}_te'] = np.nan for train_idx, val_idx in kf.split(X): # Compute stats from training fold train_df = pd.DataFrame({col: X[col].iloc[train_idx], 'y': y[train_idx]}) stats = train_df.groupby(col)['y'].agg(['mean', 'count']) # Apply smoothing smooth_mean = (stats['count'] * stats['mean'] + self.smoothing * self.global_mean_) / (stats['count'] + self.smoothing) # Apply to validation fold X.loc[X.index[val_idx], f'{col}_te'] = X[col].iloc[val_idx].map(smooth_mean) # Fill missing with global mean X[f'{col}_te'].fillna(self.global_mean_, inplace=True) # Store full mapping for test data full_stats = pd.DataFrame({col: X[col], 'y': y}).groupby(col)['y'].agg(['mean', 'count']) self.encoding_maps_[col] = (full_stats['count'] * full_stats['mean'] + self.smoothing * self.global_mean_) / (full_stats['count'] + self.smoothing) return X[[f'{c}_te' for c in self.cols]] def transform(self, X): """Transform test data using full training statistics.""" X = X.copy() result = pd.DataFrame(index=X.index) for col in self.cols: result[f'{col}_te'] = X[col].map(self.encoding_maps_[col]).fillna(self.global_mean_) return result # Example usagenp.random.seed(42)df = pd.DataFrame({ 'cat': np.random.choice(['A', 'B', 'C', 'D', 'E'], 1000, p=[0.4, 0.3, 0.15, 0.1, 0.05]), 'target': np.random.binomial(1, 0.3, 1000)}) encoder = KFoldTargetEncoder(cols=['cat'], n_folds=5, smoothing=20)encoded = encoder.fit_transform(df, df['target'])print(encoded.head(10))Weight of Evidence originated in credit scoring and is mathematically related to target encoding. For binary classification, WoE measures how much a category's presence affects the log-odds of the positive class:
$$WoE(c) = \ln\left(\frac{P(X=c | Y=1)}{P(X=c | Y=0)}\right) = \ln\left(\frac{\text{Distribution of Goods}}{\text{Distribution of Bads}}\right)$$
Interpretation:
Relationship to Target Encoding:
WoE and target encoding are monotonically related for binary targets. WoE has additional properties valuable for credit scoring:
123456789101112131415161718192021222324252627282930313233343536373839
import pandas as pdimport numpy as np def calculate_woe(df, cat_col, target_col, min_pct=0.0001): """Calculate Weight of Evidence for a categorical column.""" # Count positive and negative per category grouped = df.groupby(cat_col)[target_col].agg(['sum', 'count']) grouped.columns = ['positives', 'total'] grouped['negatives'] = grouped['total'] - grouped['positives'] # Calculate distributions total_pos = grouped['positives'].sum() total_neg = grouped['negatives'].sum() grouped['dist_pos'] = grouped['positives'] / total_pos grouped['dist_neg'] = grouped['negatives'] / total_neg # Avoid division by zero grouped['dist_pos'] = grouped['dist_pos'].clip(lower=min_pct) grouped['dist_neg'] = grouped['dist_neg'].clip(lower=min_pct) # Calculate WoE grouped['woe'] = np.log(grouped['dist_pos'] / grouped['dist_neg']) # Information Value (IV) = sum of (dist_pos - dist_neg) * WoE grouped['iv_component'] = (grouped['dist_pos'] - grouped['dist_neg']) * grouped['woe'] return grouped[['woe', 'iv_component']], grouped['iv_component'].sum() # Examplenp.random.seed(42)df = pd.DataFrame({ 'category': np.random.choice(['low_risk', 'medium_risk', 'high_risk'], 1000, p=[0.5, 0.35, 0.15]), 'default': np.random.binomial(1, [0.05, 0.15, 0.40][np.random.choice([0,1,2], 1000, p=[0.5,0.35,0.15])], 1000)}) woe_table, iv = calculate_woe(df, 'category', 'default')print(f"Total Information Value: {iv:.4f}")print(woe_table)IV summarizes the predictive power of a categorical feature: IV < 0.02 (not useful), 0.02-0.1 (weak), 0.1-0.3 (medium), 0.3-0.5 (strong), > 0.5 (suspiciously high—check for leakage).
Target encoding extends naturally to regression problems. Instead of class proportions, we use the mean (or median) of the continuous target:
$$TE_{reg}(c) = \frac{\sum_{i: X_i = c} y_i}{n_c}$$
Additional Statistics for Regression:
Beyond the mean, you can encode additional moments:
Example: House Price Prediction
For a neighborhood feature:
Neighborhood Mean Price Median Std Dev Count
'downtown' $850,000 $780,000 $120,000 500
'suburbs' $420,000 $400,000 $80,000 1200
'rural' $280,000 $260,000 $60,000 300
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import pandas as pdimport numpy as npfrom sklearn.model_selection import KFold def multi_stat_target_encode(X, y, cat_cols, stats=['mean', 'std'], n_folds=5, smoothing=10): """Encode with multiple target statistics.""" X = X.copy() y = np.array(y) global_stats = { 'mean': y.mean(), 'std': y.std(), 'median': np.median(y) } kf = KFold(n_splits=n_folds, shuffle=True, random_state=42) result_cols = [] for col in cat_cols: for stat in stats: new_col = f'{col}_{stat}' X[new_col] = np.nan result_cols.append(new_col) for train_idx, val_idx in kf.split(X): train_df = pd.DataFrame({col: X[col].iloc[train_idx], 'y': y[train_idx]}) if stat == 'mean': agg = train_df.groupby(col)['y'].mean() elif stat == 'std': agg = train_df.groupby(col)['y'].std().fillna(0) elif stat == 'median': agg = train_df.groupby(col)['y'].median() X.loc[X.index[val_idx], new_col] = X[col].iloc[val_idx].map(agg) X[new_col].fillna(global_stats.get(stat, 0), inplace=True) return X[result_cols] # Example: House pricesnp.random.seed(42)neighborhoods = ['downtown', 'suburbs', 'rural']base_prices = {'downtown': 800000, 'suburbs': 400000, 'rural': 250000} df = pd.DataFrame({ 'neighborhood': np.random.choice(neighborhoods, 500, p=[0.3, 0.5, 0.2]),})df['price'] = df['neighborhood'].map(base_prices) + np.random.randn(500) * 50000 encoded = multi_stat_target_encode(df, df['price'], ['neighborhood'], stats=['mean', 'std'])print(encoded.head())The category_encoders library provides production-ready implementations: ce.TargetEncoder(smoothing=1.0) for basic target encoding, ce.LeaveOneOutEncoder() for LOO encoding, and ce.WOEEncoder() for Weight of Evidence.
The next page covers Embedding Layers—learnable dense representations for categorical features in neural networks. Embeddings go beyond target encoding by learning task-specific representations that capture complex category relationships.