Feature Engineering For Boosting - Learning Module

Loading content...

0/245

Frequency Encoding

The Power of Category Prevalence

How often a category appears in your dataset is itself a powerful signal. Rare categories behave differently from common ones—a customer from an unusual region, a product in a niche category, or a user with an uncommon device type often exhibits distinct patterns. Frequency encoding transforms categorical features by leveraging this observation, replacing each category with its occurrence count or proportion.

Unlike target encoding, frequency encoding uses no target information, making it completely safe from target leakage. This simplicity makes it a robust, universally applicable technique that should be part of every practitioner's toolkit.

What You Will Learn

By the end of this page, you will understand frequency encoding variants (count vs. normalized), when frequency provides predictive signal, strategies for handling rare categories, and how to combine frequency encoding with other categorical encoding techniques.

Frequency Encoding Fundamentals

Basic Concept:

For a categorical feature $C$ with categories ${c_1, c_2, \ldots, c_k}$, frequency encoding computes the occurrence count or proportion for each category:

Count Encoding: $$\text{FE}_{count}(c_i) = n_i = |{j : C_j = c_i}|$$

Proportion Encoding: $$\text{FE}_{prop}(c_i) = \frac{n_i}{N}$$

Where $N$ is the total number of samples.

Example:

Category	Count	Proportion
Electronics	5,000	0.250
Clothing	8,000	0.400
Home & Garden	4,000	0.200
Sports	2,000	0.100
Toys	1,000	0.050

Why Frequency Carries Signal:

Frequency encoding captures several implicit patterns:

Popularity effects: Common categories often have different behavior (mainstream vs. niche products)
Statistical reliability: Rare categories have higher variance—encoding this helps the model adjust confidence
Business proxy: Frequency often correlates with importance, revenue, or maturity
Implicit segmentation: High-frequency vs. low-frequency naturally segments data

Key Advantages:

No target leakage: Computed purely from feature values
Robust to overfitting: No danger of memorizing labels
Handles high cardinality: Single column regardless of category count
Meaningful for trees: Split on frequency = split by popularity segment
Complementary: Works alongside target encoding and one-hot

Implementation Variants

frequency_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
 
class FrequencyEncoder(BaseEstimator, TransformerMixin):
    """
    Production-ready frequency encoder with multiple encoding variants.
    """
    
    def __init__(self, encoding_type='proportion', handle_unknown='global_mean',
                 min_frequency=None, rare_category_value=None):
        """
        Parameters:
        -----------
        encoding_type : str
            'count', 'proportion', 'log_count', 'rank'
        handle_unknown : str
            How to handle unseen categories: 'global_mean', 'zero', 'min'
        min_frequency : int or None
            Categories below this count are grouped as 'rare'
        rare_category_value : float or None
            Override value for rare categories
        """
        self.encoding_type = encoding_type
        self.handle_unknown = handle_unknown
        self.min_frequency = min_frequency
        self.rare_category_value = rare_category_value
        self.encoding_maps_ = {}
        self.global_values_ = {}
        
    def fit(self, X, y=None):
        """Compute frequency statistics from training data."""
        if isinstance(X, pd.DataFrame):
            for col in X.columns:
                self._fit_column(X[col], col)
        else:
            self._fit_column(pd.Series(X.ravel()), 'feature')
        return self
    
    def _fit_column(self, series, col_name):
        """Fit a single column."""
        counts = series.value_counts()
        n_total = len(series)
        
        # Group rare categories if threshold specified
        if self.min_frequency is not None:
            rare_mask = counts < self.min_frequency
            rare_count = counts[rare_mask].sum()
        
        if self.encoding_type == 'count':
            encoding = counts.to_dict()
            global_val = counts.mean()
        elif self.encoding_type == 'proportion':
            encoding = (counts / n_total).to_dict()
            global_val = 1.0 / len(counts)  # Uniform assumption
        elif self.encoding_type == 'log_count':
            encoding = np.log1p(counts).to_dict()
            global_val = np.log1p(counts.mean())
        elif self.encoding_type == 'rank':
            # Rank by frequency (1 = most common)
            encoding = (counts.rank(ascending=False, method='dense')).to_dict()
            global_val = len(counts) / 2  # Middle rank
        else:
            raise ValueError(f"Unknown encoding_type: {self.encoding_type}")
        
        self.encoding_maps_[col_name] = encoding
        self.global_values_[col_name] = global_val
    
    def transform(self, X):
        """Apply frequency encoding."""
        if isinstance(X, pd.DataFrame):
            X_encoded = X.copy()
            for col in X.columns:
                if col in self.encoding_maps_:
                    X_encoded[f'{col}_freq'] = self._transform_column(X[col], col)
            return X_encoded
        else:
            return self._transform_column(pd.Series(X.ravel()), 'feature').values
    
    def _transform_column(self, series, col_name):
        """Transform a single column."""
        encoded = series.map(self.encoding_maps_[col_name])
        
        # Handle unknown categories
        if self.handle_unknown == 'global_mean':
            fill_value = self.global_values_[col_name]
        elif self.handle_unknown == 'zero':
            fill_value = 0
        elif self.handle_unknown == 'min':
            fill_value = min(self.encoding_maps_[col_name].values())
        else:
            fill_value = self.global_values_[col_name]
        
        return encoded.fillna(fill_value)
 
 
# Example usage
if __name__ == "__main__":
    df = pd.DataFrame({
        'category': ['A', 'A', 'A', 'B', 'B', 'C', 'D', 'D', 'D', 'D'],
        'target': [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
    })
    
    encoder = FrequencyEncoder(encoding_type='proportion')
    df_encoded = encoder.fit_transform(df[['category']])
    print(df_encoded)

Frequency Encoding Variants Comparison
Variant	Formula	Use Case	Scale
Count	n_i	When absolute frequency matters	0 to N
Proportion	n_i / N	Normalized, dataset-size independent	0 to 1
Log-Count	log(1 + n_i)	When frequencies span orders of magnitude	0 to log(N)
Rank	rank(n_i)	Ordinal encoding by popularity	1 to k

Handling Rare Categories

Rare categories present unique challenges. A category appearing only once or twice in training may not appear in test data (or vice versa), leading to instability.

Strategies for Rare Categories:

1. Frequency Threshold Grouping: Group all categories below a minimum frequency into a single "rare" category:

min_freq = 10
counts = df['category'].value_counts()
rare_categories = counts[counts < min_freq].index
df['category_grouped'] = df['category'].replace(rare_categories, 'RARE')

2. Percentile-Based Grouping: Group categories in the bottom X% of frequency:

threshold = df['category'].value_counts().quantile(0.1)

3. Cumulative Coverage: Keep categories covering the top 95% of samples, group the rest:

counts = df['category'].value_counts(normalize=True).cumsum()
keep = counts[counts <= 0.95].index

Rare Category Signal

Sometimes 'being rare' is itself the signal! Consider creating a binary 'is_rare' feature alongside frequency encoding. This captures the distinct behavior of edge cases without losing granularity for common categories.

When Frequency Encoding Helps Most

Frequency encoding provides strong signal in specific scenarios:

High-Value Scenarios:

Scenario	Why Frequency Helps	Example
Popularity matters	Common items behave differently	Product recommendations
Trust/reliability	Established entities more reliable	Seller ratings
Statistical stability	Model can trust frequent categories more	Any high-cardinality feature
Zipf-distributed data	Captures natural power-law patterns	Word frequencies, city populations
Novelty detection	Rare = potentially anomalous	Fraud detection

Low-Value Scenarios:

Uniform distributions (all categories equally common)
When target is independent of frequency
Very low cardinality (one-hot is simpler)
When order/hierarchy of categories matters more than frequency

Best Practices

•Combine with target encoding — Frequency captures popularity; target encoding captures predictive power. Both are valuable.
•Use log transformation for power-law distributions — Most real-world frequency distributions are heavy-tailed.
•Create binary rare indicator — A separate is_rare feature often adds value.
•Be consistent between train and test — Compute frequencies on training data only, apply to test.
•Consider time-based frequency — For temporal data, use expanding or rolling frequency windows.

Combining with Other Encodings

The most effective categorical feature strategies combine multiple encoding approaches:

Recommended Encoding Stack:

For high-cardinality categorical feature 'category':
├── category_target_encoded    (predictive power)
├── category_frequency          (popularity signal)
├── category_is_rare            (edge case indicator)
└── [optional] category_one_hot (for top-N if interpretability needed)

Why Combinations Work:

Each encoding captures different information:

Target encoding: Direct predictive relationship with target
Frequency encoding: Structural information about data distribution
One-hot (top categories): Allows category-specific effects for important categories

Tree-based models can leverage all three simultaneously, using different splits for different purposes.

combined_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def create_comprehensive_categorical_features(df, cat_col, target_col, 
                                            min_freq=10, top_n_onehot=5):
    """
    Create a comprehensive set of categorical encodings for boosting.
    """
    result = df.copy()
    
    # 1. Frequency encoding
    freq = df[cat_col].value_counts()
    result[f'{cat_col}_freq'] = df[cat_col].map(freq)
    result[f'{cat_col}_freq_log'] = np.log1p(result[f'{cat_col}_freq'])
    
    # 2. Rare indicator
    result[f'{cat_col}_is_rare'] = (result[f'{cat_col}_freq'] < min_freq).astype(int)
    
    # 3. Target encoding (K-fold for training)
    # [Implementation from previous page]
    
    # 4. One-hot for top categories only
    top_categories = freq.nlargest(top_n_onehot).index
    for cat in top_categories:
        result[f'{cat_col}_is_{cat}'] = (df[cat_col] == cat).astype(int)
    
    return result

Summary: Frequency Encoding

Key Takeaways

•Frequency encoding uses category counts as features — Simple, robust, and leakage-free.
•Multiple variants exist — Count, proportion, log-count, and rank each suit different scenarios.
•Rare categories need handling — Grouping, thresholding, or explicit rare indicators prevent instability.
•Works best when frequency correlates with behavior — Popularity, trust, and power-law patterns.
•Best combined with other encodings — Target + frequency + selective one-hot is a powerful stack.

Page Complete

You now understand frequency encoding—a simple yet powerful technique for categorical features. Next, we explore broader categorical handling strategies, including ordinal encoding, hashing, and choosing between encoding methods.