Machine LearningFeature Engineering & Selection

High-Cardinality Features

LevelAdvanced

Duration90 mins

TopicFeature Engineering & Selection

2 / 5

Target Encoding

The Power of Target-Aware Encoding

Target encoding (also called mean encoding or likelihood encoding) is one of the most powerful techniques for high-cardinality categorical features. Instead of creating sparse indicator vectors, it encodes each category with the mean of the target variable for that category.

For a binary classification where target ∈ {0, 1}:

Category     Samples    Target Mean    Encoded Value
'category_A'   500         0.72            0.72
'category_B'   1200        0.31            0.31
'category_C'   89          0.85            0.85

This elegantly compresses any cardinality to a single informative column. A feature with 1 million categories becomes 1 column instead of 1 million.

The Data Leakage Trap

Naive target encoding creates severe data leakage—each row's encoding incorporates information from that same row's target. This inflates training metrics dramatically while failing in production. Proper regularization and cross-validation schemes are mandatory.

Mathematical Foundations

Basic Target Encoding Formula:

For category c in feature X, the naive target encoding is:

$$TE(c) = \frac{\sum_{i: X_i = c} y_i}{n_c} = \bar{y}_c$$

Where:

$y_i$ is the target value for sample $i$
$n_c$ is the count of samples with category $c$
$\bar{y}_c$ is the mean target for category $c$

The Variance Problem:

Rare categories pose a critical challenge. If category rare_cat appears only twice with targets [1, 0], its encoding is 0.5—but this estimate has enormous variance. A single different sample could swing it to 0.0 or 1.0.

Regularized Target Encoding (Smoothing):

To handle rare categories, we blend the category mean with the global mean:

$$TE_{smooth}(c) = \frac{n_c \cdot \bar{y}c + m \cdot \bar{y}{global}}{n_c + m}$$

Where $m$ is the smoothing parameter (regularization strength). When $n_c >> m$, the encoding approaches the category mean. When $n_c << m$, it approaches the global mean.

Effect of Smoothing Parameter m on Encoding
Category	Count (n_c)	Category Mean	Global Mean=0.4	m=1	m=10	m=100
common	1000	0.72	0.4	0.720	0.717	0.691
medium	100	0.65	0.4	0.648	0.627	0.525
rare	10	0.80	0.4	0.764	0.600	0.436
very_rare	2	1.00	0.4	0.800	0.500	0.412

Choosing the Smoothing Parameter

Common heuristics: m=10-100 for most problems. Cross-validate to find optimal m. Higher m means more regularization—better for very high cardinality or small datasets. Some implementations use m = variance(y) / variance(category_means) as an adaptive choice.

Preventing Data Leakage

Why Naive Target Encoding Leaks:

When computing target encoding on training data, each sample's encoding uses statistics that include its own target. For rare categories, this is catastrophic—a category with one sample gets encoded as exactly its target value, providing a perfect (but useless) predictor.

Leave-One-Out (LOO) Encoding:

Exclude the current sample when computing the category statistic:

$$TE_{LOO}(x_i) = \frac{\sum_{j eq i, X_j = X_i} y_j}{n_{X_i} - 1}$$

This prevents direct leakage but still allows indirect leakage through correlated samples.

K-Fold Target Encoding (Recommended):

The gold standard approach mirrors cross-validation:

Split training data into K folds
For each fold, compute target statistics from the other K-1 folds
Apply those statistics to encode the held-out fold
For test data, use statistics from all training data

target_encoding_cv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
 
class KFoldTargetEncoder:
    """K-Fold target encoding with smoothing to prevent leakage."""
    
    def __init__(self, cols, n_folds=5, smoothing=10):
        self.cols = cols
        self.n_folds = n_folds
        self.smoothing = smoothing
        self.global_mean_ = None
        self.encoding_maps_ = {}
    
    def fit_transform(self, X, y):
        """Fit on training data using K-fold scheme."""
        X = X.copy()
        y = np.array(y)
        self.global_mean_ = y.mean()
        
        kf = KFold(n_splits=self.n_folds, shuffle=True, random_state=42)
        
        for col in self.cols:
            X[f'{col}_te'] = np.nan
            
            for train_idx, val_idx in kf.split(X):
                # Compute stats from training fold
                train_df = pd.DataFrame({col: X[col].iloc[train_idx], 'y': y[train_idx]})
                stats = train_df.groupby(col)['y'].agg(['mean', 'count'])
                
                # Apply smoothing
                smooth_mean = (stats['count'] * stats['mean'] + self.smoothing * self.global_mean_) / (stats['count'] + self.smoothing)
                
                # Apply to validation fold
                X.loc[X.index[val_idx], f'{col}_te'] = X[col].iloc[val_idx].map(smooth_mean)
            
            # Fill missing with global mean
            X[f'{col}_te'].fillna(self.global_mean_, inplace=True)
            
            # Store full mapping for test data
            full_stats = pd.DataFrame({col: X[col], 'y': y}).groupby(col)['y'].agg(['mean', 'count'])
            self.encoding_maps_[col] = (full_stats['count'] * full_stats['mean'] + self.smoothing * self.global_mean_) / (full_stats['count'] + self.smoothing)
        
        return X[[f'{c}_te' for c in self.cols]]
    
    def transform(self, X):
        """Transform test data using full training statistics."""
        X = X.copy()
        result = pd.DataFrame(index=X.index)
        
        for col in self.cols:
            result[f'{col}_te'] = X[col].map(self.encoding_maps_[col]).fillna(self.global_mean_)
        
        return result
 
# Example usage
np.random.seed(42)
df = pd.DataFrame({
    'cat': np.random.choice(['A', 'B', 'C', 'D', 'E'], 1000, p=[0.4, 0.3, 0.15, 0.1, 0.05]),
    'target': np.random.binomial(1, 0.3, 1000)
})
 
encoder = KFoldTargetEncoder(cols=['cat'], n_folds=5, smoothing=20)
encoded = encoder.fit_transform(df, df['target'])
print(encoded.head(10))

Weight of Evidence (WoE) Encoding

Weight of Evidence originated in credit scoring and is mathematically related to target encoding. For binary classification, WoE measures how much a category's presence affects the log-odds of the positive class:

$$WoE(c) = \ln\left(\frac{P(X=c | Y=1)}{P(X=c | Y=0)}\right) = \ln\left(\frac{\text{Distribution of Goods}}{\text{Distribution of Bads}}\right)$$

Interpretation:

WoE > 0: Category is associated with positive class
WoE < 0: Category is associated with negative class
WoE ≈ 0: Category has little predictive power

Relationship to Target Encoding:

WoE and target encoding are monotonically related for binary targets. WoE has additional properties valuable for credit scoring:

Naturally handles different base rates
Produces features with linear relationship to log-odds (ideal for logistic regression)
Facilitates regulatory explainability

woe_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
import numpy as np
 
def calculate_woe(df, cat_col, target_col, min_pct=0.0001):
    """Calculate Weight of Evidence for a categorical column."""
    # Count positive and negative per category
    grouped = df.groupby(cat_col)[target_col].agg(['sum', 'count'])
    grouped.columns = ['positives', 'total']
    grouped['negatives'] = grouped['total'] - grouped['positives']
    
    # Calculate distributions
    total_pos = grouped['positives'].sum()
    total_neg = grouped['negatives'].sum()
    
    grouped['dist_pos'] = grouped['positives'] / total_pos
    grouped['dist_neg'] = grouped['negatives'] / total_neg
    
    # Avoid division by zero
    grouped['dist_pos'] = grouped['dist_pos'].clip(lower=min_pct)
    grouped['dist_neg'] = grouped['dist_neg'].clip(lower=min_pct)
    
    # Calculate WoE
    grouped['woe'] = np.log(grouped['dist_pos'] / grouped['dist_neg'])
    
    # Information Value (IV) = sum of (dist_pos - dist_neg) * WoE
    grouped['iv_component'] = (grouped['dist_pos'] - grouped['dist_neg']) * grouped['woe']
    
    return grouped[['woe', 'iv_component']], grouped['iv_component'].sum()
 
# Example
np.random.seed(42)
df = pd.DataFrame({
    'category': np.random.choice(['low_risk', 'medium_risk', 'high_risk'], 1000, p=[0.5, 0.35, 0.15]),
    'default': np.random.binomial(1, [0.05, 0.15, 0.40][np.random.choice([0,1,2], 1000, p=[0.5,0.35,0.15])], 1000)
})
 
woe_table, iv = calculate_woe(df, 'category', 'default')
print(f"Total Information Value: {iv:.4f}")
print(woe_table)

Information Value (IV) for Feature Selection

IV summarizes the predictive power of a categorical feature: IV < 0.02 (not useful), 0.02-0.1 (weak), 0.1-0.3 (medium), 0.3-0.5 (strong), > 0.5 (suspiciously high—check for leakage).

Target Encoding for Regression

Target encoding extends naturally to regression problems. Instead of class proportions, we use the mean (or median) of the continuous target:

$$TE_{reg}(c) = \frac{\sum_{i: X_i = c} y_i}{n_c}$$

Additional Statistics for Regression:

Beyond the mean, you can encode additional moments:

Median: More robust to outliers
Standard deviation: Captures variability within category
Quantiles: Encode multiple percentiles for richer representation
Min/Max: Capture extremes

Example: House Price Prediction

For a neighborhood feature:

Neighborhood    Mean Price    Median    Std Dev    Count
'downtown'      $850,000      $780,000  $120,000   500
'suburbs'       $420,000      $400,000  $80,000    1200  
'rural'         $280,000      $260,000  $60,000    300

regression_target_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
 
def multi_stat_target_encode(X, y, cat_cols, stats=['mean', 'std'], n_folds=5, smoothing=10):
    """Encode with multiple target statistics."""
    X = X.copy()
    y = np.array(y)
    global_stats = {
        'mean': y.mean(),
        'std': y.std(),
        'median': np.median(y)
    }
    
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    result_cols = []
    
    for col in cat_cols:
        for stat in stats:
            new_col = f'{col}_{stat}'
            X[new_col] = np.nan
            result_cols.append(new_col)
            
            for train_idx, val_idx in kf.split(X):
                train_df = pd.DataFrame({col: X[col].iloc[train_idx], 'y': y[train_idx]})
                
                if stat == 'mean':
                    agg = train_df.groupby(col)['y'].mean()
                elif stat == 'std':
                    agg = train_df.groupby(col)['y'].std().fillna(0)
                elif stat == 'median':
                    agg = train_df.groupby(col)['y'].median()
                
                X.loc[X.index[val_idx], new_col] = X[col].iloc[val_idx].map(agg)
            
            X[new_col].fillna(global_stats.get(stat, 0), inplace=True)
    
    return X[result_cols]
 
# Example: House prices
np.random.seed(42)
neighborhoods = ['downtown', 'suburbs', 'rural']
base_prices = {'downtown': 800000, 'suburbs': 400000, 'rural': 250000}
 
df = pd.DataFrame({
    'neighborhood': np.random.choice(neighborhoods, 500, p=[0.3, 0.5, 0.2]),
})
df['price'] = df['neighborhood'].map(base_prices) + np.random.randn(500) * 50000
 
encoded = multi_stat_target_encode(df, df['price'], ['neighborhood'], stats=['mean', 'std'])
print(encoded.head())

Production Considerations

Production Best Practices

•Store global mean separately — Always persist the global mean for unknown category fallback.
•Version your encodings — Target statistics change with new training data. Track encoder versions with model versions.
•Monitor encoding drift — If production category distributions shift, encoded values may become stale.
•Set minimum category count — Categories with very few samples should fall back to global mean regardless of smoothing.
•Use cross-validation for hyperparameters — Tune smoothing parameter m on validation data, not training data.
•Consider temporal aspects — For time-series, compute target encoding from historical data only.

Category Encoders Library

The category_encoders library provides production-ready implementations: ce.TargetEncoder(smoothing=1.0) for basic target encoding, ce.LeaveOneOutEncoder() for LOO encoding, and ce.WOEEncoder() for Weight of Evidence.

Summary

Key Takeaways

•Target encoding compresses cardinality — Any number of categories becomes a single informative column.
•Regularization is mandatory — Smoothing toward global mean prevents overfitting on rare categories.
•K-fold encoding prevents leakage — Never compute target statistics that include the current sample's target.
•WoE is related but distinct — Useful for credit scoring with natural log-odds interpretation.
•Extends to regression — Use mean/median/std of continuous targets with same principles.

Coming Next: Embedding Layers

The next page covers Embedding Layers—learnable dense representations for categorical features in neural networks. Embeddings go beyond target encoding by learning task-specific representations that capture complex category relationships.

2 / 5

Loading learning content...

Machine LearningFeature Engineering & Selection

High-Cardinality Features

LevelAdvanced

Duration90 mins

TopicFeature Engineering & Selection

2 / 5

Target Encoding

The Power of Target-Aware Encoding

For a binary classification where target ∈ {0, 1}:

Category     Samples    Target Mean    Encoded Value
'category_A'   500         0.72            0.72
'category_B'   1200        0.31            0.31
'category_C'   89          0.85            0.85

This elegantly compresses any cardinality to a single informative column. A feature with 1 million categories becomes 1 column instead of 1 million.

The Data Leakage Trap

Mathematical Foundations

Basic Target Encoding Formula:

For category c in feature X, the naive target encoding is:

$$TE(c) = \frac{\sum_{i: X_i = c} y_i}{n_c} = \bar{y}_c$$

Where:

$y_i$ is the target value for sample $i$
$n_c$ is the count of samples with category $c$
$\bar{y}_c$ is the mean target for category $c$

The Variance Problem:

Regularized Target Encoding (Smoothing):

To handle rare categories, we blend the category mean with the global mean:

$$TE_{smooth}(c) = \frac{n_c \cdot \bar{y}c + m \cdot \bar{y}{global}}{n_c + m}$$

Where $m$ is the smoothing parameter (regularization strength). When $n_c >> m$, the encoding approaches the category mean. When $n_c << m$, it approaches the global mean.

Effect of Smoothing Parameter m on Encoding
Category	Count (n_c)	Category Mean	Global Mean=0.4	m=1	m=10	m=100
common	1000	0.72	0.4	0.720	0.717	0.691
medium	100	0.65	0.4	0.648	0.627	0.525
rare	10	0.80	0.4	0.764	0.600	0.436
very_rare	2	1.00	0.4	0.800	0.500	0.412

Choosing the Smoothing Parameter

Preventing Data Leakage

Why Naive Target Encoding Leaks:

Leave-One-Out (LOO) Encoding:

Exclude the current sample when computing the category statistic:

$$TE_{LOO}(x_i) = \frac{\sum_{j eq i, X_j = X_i} y_j}{n_{X_i} - 1}$$

This prevents direct leakage but still allows indirect leakage through correlated samples.

K-Fold Target Encoding (Recommended):

The gold standard approach mirrors cross-validation:

Split training data into K folds
For each fold, compute target statistics from the other K-1 folds
Apply those statistics to encode the held-out fold
For test data, use statistics from all training data

target_encoding_cv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
 
class KFoldTargetEncoder:
    """K-Fold target encoding with smoothing to prevent leakage."""
    
    def __init__(self, cols, n_folds=5, smoothing=10):
        self.cols = cols
        self.n_folds = n_folds
        self.smoothing = smoothing
        self.global_mean_ = None
        self.encoding_maps_ = {}
    
    def fit_transform(self, X, y):
        """Fit on training data using K-fold scheme."""
        X = X.copy()
        y = np.array(y)
        self.global_mean_ = y.mean()
        
        kf = KFold(n_splits=self.n_folds, shuffle=True, random_state=42)
        
        for col in self.cols:
            X[f'{col}_te'] = np.nan
            
            for train_idx, val_idx in kf.split(X):
                # Compute stats from training fold
                train_df = pd.DataFrame({col: X[col].iloc[train_idx], 'y': y[train_idx]})
                stats = train_df.groupby(col)['y'].agg(['mean', 'count'])
                
                # Apply smoothing
                smooth_mean = (stats['count'] * stats['mean'] + self.smoothing * self.global_mean_) / (stats['count'] + self.smoothing)
                
                # Apply to validation fold
                X.loc[X.index[val_idx], f'{col}_te'] = X[col].iloc[val_idx].map(smooth_mean)
            
            # Fill missing with global mean
            X[f'{col}_te'].fillna(self.global_mean_, inplace=True)
            
            # Store full mapping for test data
            full_stats = pd.DataFrame({col: X[col], 'y': y}).groupby(col)['y'].agg(['mean', 'count'])
            self.encoding_maps_[col] = (full_stats['count'] * full_stats['mean'] + self.smoothing * self.global_mean_) / (full_stats['count'] + self.smoothing)
        
        return X[[f'{c}_te' for c in self.cols]]
    
    def transform(self, X):
        """Transform test data using full training statistics."""
        X = X.copy()
        result = pd.DataFrame(index=X.index)
        
        for col in self.cols:
            result[f'{col}_te'] = X[col].map(self.encoding_maps_[col]).fillna(self.global_mean_)
        
        return result
 
# Example usage
np.random.seed(42)
df = pd.DataFrame({
    'cat': np.random.choice(['A', 'B', 'C', 'D', 'E'], 1000, p=[0.4, 0.3, 0.15, 0.1, 0.05]),
    'target': np.random.binomial(1, 0.3, 1000)
})
 
encoder = KFoldTargetEncoder(cols=['cat'], n_folds=5, smoothing=20)
encoded = encoder.fit_transform(df, df['target'])
print(encoded.head(10))

Weight of Evidence (WoE) Encoding

$$WoE(c) = \ln\left(\frac{P(X=c | Y=1)}{P(X=c | Y=0)}\right) = \ln\left(\frac{\text{Distribution of Goods}}{\text{Distribution of Bads}}\right)$$

Interpretation:

WoE > 0: Category is associated with positive class
WoE < 0: Category is associated with negative class
WoE ≈ 0: Category has little predictive power

Relationship to Target Encoding:

WoE and target encoding are monotonically related for binary targets. WoE has additional properties valuable for credit scoring:

Naturally handles different base rates
Produces features with linear relationship to log-odds (ideal for logistic regression)
Facilitates regulatory explainability

woe_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
import numpy as np
 
def calculate_woe(df, cat_col, target_col, min_pct=0.0001):
    """Calculate Weight of Evidence for a categorical column."""
    # Count positive and negative per category
    grouped = df.groupby(cat_col)[target_col].agg(['sum', 'count'])
    grouped.columns = ['positives', 'total']
    grouped['negatives'] = grouped['total'] - grouped['positives']
    
    # Calculate distributions
    total_pos = grouped['positives'].sum()
    total_neg = grouped['negatives'].sum()
    
    grouped['dist_pos'] = grouped['positives'] / total_pos
    grouped['dist_neg'] = grouped['negatives'] / total_neg
    
    # Avoid division by zero
    grouped['dist_pos'] = grouped['dist_pos'].clip(lower=min_pct)
    grouped['dist_neg'] = grouped['dist_neg'].clip(lower=min_pct)
    
    # Calculate WoE
    grouped['woe'] = np.log(grouped['dist_pos'] / grouped['dist_neg'])
    
    # Information Value (IV) = sum of (dist_pos - dist_neg) * WoE
    grouped['iv_component'] = (grouped['dist_pos'] - grouped['dist_neg']) * grouped['woe']
    
    return grouped[['woe', 'iv_component']], grouped['iv_component'].sum()
 
# Example
np.random.seed(42)
df = pd.DataFrame({
    'category': np.random.choice(['low_risk', 'medium_risk', 'high_risk'], 1000, p=[0.5, 0.35, 0.15]),
    'default': np.random.binomial(1, [0.05, 0.15, 0.40][np.random.choice([0,1,2], 1000, p=[0.5,0.35,0.15])], 1000)
})
 
woe_table, iv = calculate_woe(df, 'category', 'default')
print(f"Total Information Value: {iv:.4f}")
print(woe_table)

Information Value (IV) for Feature Selection

IV summarizes the predictive power of a categorical feature: IV < 0.02 (not useful), 0.02-0.1 (weak), 0.1-0.3 (medium), 0.3-0.5 (strong), > 0.5 (suspiciously high—check for leakage).

Target Encoding for Regression

Target encoding extends naturally to regression problems. Instead of class proportions, we use the mean (or median) of the continuous target:

$$TE_{reg}(c) = \frac{\sum_{i: X_i = c} y_i}{n_c}$$

Additional Statistics for Regression:

Beyond the mean, you can encode additional moments:

Median: More robust to outliers
Standard deviation: Captures variability within category
Quantiles: Encode multiple percentiles for richer representation
Min/Max: Capture extremes

Example: House Price Prediction

For a neighborhood feature:

Neighborhood    Mean Price    Median    Std Dev    Count
'downtown'      $850,000      $780,000  $120,000   500
'suburbs'       $420,000      $400,000  $80,000    1200  
'rural'         $280,000      $260,000  $60,000    300

regression_target_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
 
def multi_stat_target_encode(X, y, cat_cols, stats=['mean', 'std'], n_folds=5, smoothing=10):
    """Encode with multiple target statistics."""
    X = X.copy()
    y = np.array(y)
    global_stats = {
        'mean': y.mean(),
        'std': y.std(),
        'median': np.median(y)
    }
    
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    result_cols = []
    
    for col in cat_cols:
        for stat in stats:
            new_col = f'{col}_{stat}'
            X[new_col] = np.nan
            result_cols.append(new_col)
            
            for train_idx, val_idx in kf.split(X):
                train_df = pd.DataFrame({col: X[col].iloc[train_idx], 'y': y[train_idx]})
                
                if stat == 'mean':
                    agg = train_df.groupby(col)['y'].mean()
                elif stat == 'std':
                    agg = train_df.groupby(col)['y'].std().fillna(0)
                elif stat == 'median':
                    agg = train_df.groupby(col)['y'].median()
                
                X.loc[X.index[val_idx], new_col] = X[col].iloc[val_idx].map(agg)
            
            X[new_col].fillna(global_stats.get(stat, 0), inplace=True)
    
    return X[result_cols]
 
# Example: House prices
np.random.seed(42)
neighborhoods = ['downtown', 'suburbs', 'rural']
base_prices = {'downtown': 800000, 'suburbs': 400000, 'rural': 250000}
 
df = pd.DataFrame({
    'neighborhood': np.random.choice(neighborhoods, 500, p=[0.3, 0.5, 0.2]),
})
df['price'] = df['neighborhood'].map(base_prices) + np.random.randn(500) * 50000
 
encoded = multi_stat_target_encode(df, df['price'], ['neighborhood'], stats=['mean', 'std'])
print(encoded.head())

Production Considerations

Production Best Practices

•Store global mean separately — Always persist the global mean for unknown category fallback.
•Version your encodings — Target statistics change with new training data. Track encoder versions with model versions.
•Monitor encoding drift — If production category distributions shift, encoded values may become stale.
•Set minimum category count — Categories with very few samples should fall back to global mean regardless of smoothing.
•Use cross-validation for hyperparameters — Tune smoothing parameter m on validation data, not training data.
•Consider temporal aspects — For time-series, compute target encoding from historical data only.

Category Encoders Library

Summary

Key Takeaways

•Target encoding compresses cardinality — Any number of categories becomes a single informative column.
•Regularization is mandatory — Smoothing toward global mean prevents overfitting on rare categories.
•K-fold encoding prevents leakage — Never compute target statistics that include the current sample's target.
•WoE is related but distinct — Useful for credit scoring with natural log-odds interpretation.
•Extends to regression — Use mean/median/std of continuous targets with same principles.

Coming Next: Embedding Layers

2 / 5