Loading content...
How often a category appears in your dataset is itself a powerful signal. Rare categories behave differently from common ones—a customer from an unusual region, a product in a niche category, or a user with an uncommon device type often exhibits distinct patterns. Frequency encoding transforms categorical features by leveraging this observation, replacing each category with its occurrence count or proportion.
Unlike target encoding, frequency encoding uses no target information, making it completely safe from target leakage. This simplicity makes it a robust, universally applicable technique that should be part of every practitioner's toolkit.
By the end of this page, you will understand frequency encoding variants (count vs. normalized), when frequency provides predictive signal, strategies for handling rare categories, and how to combine frequency encoding with other categorical encoding techniques.
Basic Concept:
For a categorical feature $C$ with categories ${c_1, c_2, \ldots, c_k}$, frequency encoding computes the occurrence count or proportion for each category:
Count Encoding: $$\text{FE}_{count}(c_i) = n_i = |{j : C_j = c_i}|$$
Proportion Encoding: $$\text{FE}_{prop}(c_i) = \frac{n_i}{N}$$
Where $N$ is the total number of samples.
Example:
| Category | Count | Proportion |
|---|---|---|
| Electronics | 5,000 | 0.250 |
| Clothing | 8,000 | 0.400 |
| Home & Garden | 4,000 | 0.200 |
| Sports | 2,000 | 0.100 |
| Toys | 1,000 | 0.050 |
Why Frequency Carries Signal:
Frequency encoding captures several implicit patterns:
Key Advantages:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
import numpy as npimport pandas as pdfrom sklearn.base import BaseEstimator, TransformerMixin class FrequencyEncoder(BaseEstimator, TransformerMixin): """ Production-ready frequency encoder with multiple encoding variants. """ def __init__(self, encoding_type='proportion', handle_unknown='global_mean', min_frequency=None, rare_category_value=None): """ Parameters: ----------- encoding_type : str 'count', 'proportion', 'log_count', 'rank' handle_unknown : str How to handle unseen categories: 'global_mean', 'zero', 'min' min_frequency : int or None Categories below this count are grouped as 'rare' rare_category_value : float or None Override value for rare categories """ self.encoding_type = encoding_type self.handle_unknown = handle_unknown self.min_frequency = min_frequency self.rare_category_value = rare_category_value self.encoding_maps_ = {} self.global_values_ = {} def fit(self, X, y=None): """Compute frequency statistics from training data.""" if isinstance(X, pd.DataFrame): for col in X.columns: self._fit_column(X[col], col) else: self._fit_column(pd.Series(X.ravel()), 'feature') return self def _fit_column(self, series, col_name): """Fit a single column.""" counts = series.value_counts() n_total = len(series) # Group rare categories if threshold specified if self.min_frequency is not None: rare_mask = counts < self.min_frequency rare_count = counts[rare_mask].sum() if self.encoding_type == 'count': encoding = counts.to_dict() global_val = counts.mean() elif self.encoding_type == 'proportion': encoding = (counts / n_total).to_dict() global_val = 1.0 / len(counts) # Uniform assumption elif self.encoding_type == 'log_count': encoding = np.log1p(counts).to_dict() global_val = np.log1p(counts.mean()) elif self.encoding_type == 'rank': # Rank by frequency (1 = most common) encoding = (counts.rank(ascending=False, method='dense')).to_dict() global_val = len(counts) / 2 # Middle rank else: raise ValueError(f"Unknown encoding_type: {self.encoding_type}") self.encoding_maps_[col_name] = encoding self.global_values_[col_name] = global_val def transform(self, X): """Apply frequency encoding.""" if isinstance(X, pd.DataFrame): X_encoded = X.copy() for col in X.columns: if col in self.encoding_maps_: X_encoded[f'{col}_freq'] = self._transform_column(X[col], col) return X_encoded else: return self._transform_column(pd.Series(X.ravel()), 'feature').values def _transform_column(self, series, col_name): """Transform a single column.""" encoded = series.map(self.encoding_maps_[col_name]) # Handle unknown categories if self.handle_unknown == 'global_mean': fill_value = self.global_values_[col_name] elif self.handle_unknown == 'zero': fill_value = 0 elif self.handle_unknown == 'min': fill_value = min(self.encoding_maps_[col_name].values()) else: fill_value = self.global_values_[col_name] return encoded.fillna(fill_value) # Example usageif __name__ == "__main__": df = pd.DataFrame({ 'category': ['A', 'A', 'A', 'B', 'B', 'C', 'D', 'D', 'D', 'D'], 'target': [1, 0, 1, 1, 0, 1, 0, 0, 1, 1] }) encoder = FrequencyEncoder(encoding_type='proportion') df_encoded = encoder.fit_transform(df[['category']]) print(df_encoded)| Variant | Formula | Use Case | Scale |
|---|---|---|---|
| Count | n_i | When absolute frequency matters | 0 to N |
| Proportion | n_i / N | Normalized, dataset-size independent | 0 to 1 |
| Log-Count | log(1 + n_i) | When frequencies span orders of magnitude | 0 to log(N) |
| Rank | rank(n_i) | Ordinal encoding by popularity | 1 to k |
Rare categories present unique challenges. A category appearing only once or twice in training may not appear in test data (or vice versa), leading to instability.
Strategies for Rare Categories:
1. Frequency Threshold Grouping: Group all categories below a minimum frequency into a single "rare" category:
min_freq = 10
counts = df['category'].value_counts()
rare_categories = counts[counts < min_freq].index
df['category_grouped'] = df['category'].replace(rare_categories, 'RARE')
2. Percentile-Based Grouping: Group categories in the bottom X% of frequency:
threshold = df['category'].value_counts().quantile(0.1)
3. Cumulative Coverage: Keep categories covering the top 95% of samples, group the rest:
counts = df['category'].value_counts(normalize=True).cumsum()
keep = counts[counts <= 0.95].index
Sometimes 'being rare' is itself the signal! Consider creating a binary 'is_rare' feature alongside frequency encoding. This captures the distinct behavior of edge cases without losing granularity for common categories.
Frequency encoding provides strong signal in specific scenarios:
High-Value Scenarios:
| Scenario | Why Frequency Helps | Example |
|---|---|---|
| Popularity matters | Common items behave differently | Product recommendations |
| Trust/reliability | Established entities more reliable | Seller ratings |
| Statistical stability | Model can trust frequent categories more | Any high-cardinality feature |
| Zipf-distributed data | Captures natural power-law patterns | Word frequencies, city populations |
| Novelty detection | Rare = potentially anomalous | Fraud detection |
Low-Value Scenarios:
The most effective categorical feature strategies combine multiple encoding approaches:
Recommended Encoding Stack:
For high-cardinality categorical feature 'category':
├── category_target_encoded (predictive power)
├── category_frequency (popularity signal)
├── category_is_rare (edge case indicator)
└── [optional] category_one_hot (for top-N if interpretability needed)
Why Combinations Work:
Each encoding captures different information:
Tree-based models can leverage all three simultaneously, using different splits for different purposes.
123456789101112131415161718192021222324
def create_comprehensive_categorical_features(df, cat_col, target_col, min_freq=10, top_n_onehot=5): """ Create a comprehensive set of categorical encodings for boosting. """ result = df.copy() # 1. Frequency encoding freq = df[cat_col].value_counts() result[f'{cat_col}_freq'] = df[cat_col].map(freq) result[f'{cat_col}_freq_log'] = np.log1p(result[f'{cat_col}_freq']) # 2. Rare indicator result[f'{cat_col}_is_rare'] = (result[f'{cat_col}_freq'] < min_freq).astype(int) # 3. Target encoding (K-fold for training) # [Implementation from previous page] # 4. One-hot for top categories only top_categories = freq.nlargest(top_n_onehot).index for cat in top_categories: result[f'{cat_col}_is_{cat}'] = (df[cat_col] == cat).astype(int) return resultYou now understand frequency encoding—a simple yet powerful technique for categorical features. Next, we explore broader categorical handling strategies, including ordinal encoding, hashing, and choosing between encoding methods.