High Cardinality Features - Learning Module

Loading content...

0/278

Categorical Encoding

The High-Cardinality Challenge

Imagine you're building a recommendation system for an e-commerce platform with 10 million unique products, a fraud detection system processing transactions from 50,000 merchant IDs, or a natural language model handling a vocabulary of 500,000 words. Each of these scenarios presents a fundamental challenge that separates production-grade machine learning from academic exercises: high-cardinality categorical features.

Categorical features—variables that take on discrete, non-numeric values—are ubiquitous in real-world data. User IDs, product SKUs, geographic regions, job titles, device types, browser fingerprints—these features often carry crucial signal for predictions. Yet they cannot be fed directly into mathematical models that expect numerical inputs.

The naive solution—one-hot encoding—works beautifully for features with a handful of categories. But when cardinality explodes into thousands, millions, or beyond, this approach collapses under its own weight. A one-hot encoded representation of 10 million products would require vectors of 10 million dimensions—computationally intractable and statistically meaningless.

What You Will Learn

This comprehensive module equips you with the full arsenal of techniques for handling high-cardinality categorical features. From foundational encoding schemes through cutting-edge embedding approaches, you'll develop the expertise to make principled decisions about encoding strategies in any production ML system.

Foundations of Categorical Encoding

Before diving into high-cardinality solutions, we must establish a rigorous understanding of why categorical encoding exists and what properties we desire from encoding schemes.

Why Encode at All?

Machine learning algorithms, at their core, perform mathematical operations: matrix multiplications, gradient computations, distance calculations, kernel evaluations. These operations require numerical representations. A category label like 'electronics' or 'user_4829173' has no inherent numerical meaning—we must impose structure.

The encoding we choose fundamentally shapes what the model can learn. Poor encoding decisions can:

Introduce spurious ordinal relationships where none exist
Explode dimensionality beyond computational feasibility
Lose valuable semantic information present in the categories
Create data leakage that invalidates model performance estimates
Fail catastrophically when encountering previously unseen categories

Desirable Properties of Categorical Encodings

•Dimensionality Efficiency — The encoding should scale gracefully with cardinality, avoiding explosive growth in feature space.
•Semantic Preservation — Similar categories should map to similar representations; dissimilar categories should be distinguishable.
•Computational Tractability — Encoding and decoding must be fast enough for training and real-time inference.
•Generalization to Novel Values — The scheme should handle categories never seen during training gracefully.
•No Data Leakage — The encoding process must not inadvertently incorporate target information in ways that inflate performance estimates.
•Stability — Small changes in training data should not cause dramatic shifts in representations.

The Cardinality Spectrum

There's no universal threshold defining 'high cardinality.' A feature with 100 categories might be high-cardinality in a dataset with 1,000 samples (sparse representation) but low-cardinality in a dataset with 100 million samples (dense representation). Always consider cardinality relative to sample size—the category-to-sample ratio.

Label Encoding — Simplicity and Its Pitfalls

Label encoding (also called ordinal encoding or integer encoding) assigns each unique category an integer from 0 to K-1, where K is the cardinality.

Category      →  Encoded Value
'apple'       →  0
'banana'      →  1
'cherry'      →  2
'date'        →  3

This is the most memory-efficient encoding possible—storing a single integer rather than a vector. For a feature with 1 million categories, label encoding requires only 20 bits per sample versus 1 million bits for one-hot encoding.

When Label Encoding Works:

Label encoding is appropriate when:

True ordinal relationship exists — Categories have a natural ordering (e.g., education levels: 'High School' < 'Bachelor's' < 'Master's' < 'PhD')
Tree-based models — Decision trees, random forests, and gradient boosting methods can exploit splits on integer-encoded features without assuming ordinality
Cardinality is extremely high — As a practical compromise when other methods are computationally infeasible

The Ordinal Trap

For linear models, SVMs, neural networks, and distance-based methods (k-NN, k-means), label encoding introduces spurious ordinal relationships. The model interprets 'cherry' (2) as 'between' 'banana' (1) and 'date' (3), and closer to 'banana' than to 'apple'. When categories have no inherent order, this injects misleading signal that degrades model performance.

label_encoding_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
 
# Sample data
df = pd.DataFrame({
    'fruit': ['apple', 'banana', 'cherry', 'apple', 'date', 'banana'],
    'size': ['small', 'medium', 'large', 'small', 'medium', 'large']
})
 
# Method 1: LabelEncoder (single column, learns mapping automatically)
label_encoder = LabelEncoder()
df['fruit_encoded'] = label_encoder.fit_transform(df['fruit'])
 
# Access the mapping
print("Label classes:", label_encoder.classes_)
# Output: ['apple' 'banana' 'cherry' 'date']
 
# Inverse transform
original = label_encoder.inverse_transform([0, 1, 2, 3])
print("Decoded:", original)
 
# Method 2: OrdinalEncoder (multiple columns, explicit category ordering)
# Useful when you want to enforce a specific order
ordinal_encoder = OrdinalEncoder(
    categories=[['small', 'medium', 'large']],  # Explicit ordering
    handle_unknown='use_encoded_value',
    unknown_value=-1  # Assign -1 to unseen categories
)
df['size_encoded'] = ordinal_encoder.fit_transform(df[['size']])
 
# Check the result
print(df)
 
# Handling unseen categories at inference time
new_data = pd.DataFrame({'size': ['tiny', 'large', 'xl']})
new_encoded = ordinal_encoder.transform(new_data)
print("With unknowns:", new_encoded)  # [-1, 2, -1]

Critical Implementation Details:

Consistency between training and inference — The same category must always map to the same integer. Store the encoder object or the explicit mapping dictionary.
Handling unseen categories — At inference time, you'll encounter categories not present during training. Strategies include:
- Map to a dedicated 'unknown' value (e.g., -1 or K)
- Map to the most frequent training category
- Raise an error and flag for manual review
- Use a fallback encoding scheme
Memory of fit — LabelEncoder.fit() memorizes the categories seen. If training data evolves, you need retraining or incremental update strategies.

One-Hot Encoding — The Gold Standard for Low Cardinality

One-hot encoding (OHE), also called dummy encoding, creates K binary indicator columns for a feature with K categories. Each sample has exactly one '1' and K-1 '0's across these columns.

Category    →  apple  banana  cherry  date
'apple'     →   1       0        0      0
'banana'    →   0       1        0      0  
'cherry'    →   0       0        1      0
'date'      →   0       0        0      1

Why One-Hot Encoding Works:

No spurious ordinality — Each category exists in its own dimension; no category is 'between' others
Equal distance — All categories are equidistant in the encoded space (distance = √2 between any two categories)
Linear separability — Any linear function of categories can be expressed as a linear function of OHE features
Interpretability — Coefficients directly indicate the effect of each category

The Dimensionality Problem:

For K categories, OHE produces K dimensions. This creates three interrelated problems:

Memory explosion — 1M categories × 1M samples = 1 trillion elements to store
Computational burden — Matrix operations scale with dimensionality
Statistical sparsity — Each dimension has only n/K samples on average; rare categories have insufficient examples

One-Hot Encoding: Memory Requirements vs Cardinality
Cardinality (K)	Samples (n)	Dense Storage	Sparse Storage (1% density)	Practical?
10	100,000	1 MB	~40 KB	✅ Trivial
100	100,000	10 MB	~400 KB	✅ Easy
1,000	100,000	100 MB	~4 MB	✅ Manageable
10,000	100,000	1 GB	~40 MB	⚠️ Challenging
100,000	100,000	10 GB	~400 MB	❌ Impractical
1,000,000	100,000	100 GB	~4 GB	❌ Infeasible

one_hot_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
 
# Sample data
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue', 'green', 'yellow'],
    'size': ['S', 'M', 'L', 'M', 'S', 'L', 'M']
})
 
# Method 1: pandas get_dummies (convenient but creates dense arrays)
df_encoded = pd.get_dummies(df, columns=['color', 'size'], prefix=['color', 'size'])
print("Pandas get_dummies result:")
print(df_encoded)
 
# Method 2: sklearn OneHotEncoder (production-grade, handles unknowns)
encoder = OneHotEncoder(
    sparse_output=True,     # Return sparse matrix for memory efficiency
    handle_unknown='ignore', # Silently ignore unseen categories (all zeros)
    drop='first',            # Drop first category to avoid multicollinearity
    min_frequency=2,         # Combine rare categories (frequency < 2)
    max_categories=10        # Limit max categories (combine rest as 'infrequent')
)
 
# Fit and transform
encoded_sparse = encoder.fit_transform(df[['color', 'size']])
print(f"
Sparse matrix shape: {encoded_sparse.shape}")
print(f"Sparse matrix density: {encoded_sparse.nnz / np.prod(encoded_sparse.shape):.2%}")
print(f"Memory usage: {encoded_sparse.data.nbytes + encoded_sparse.indices.nbytes + encoded_sparse.indptr.nbytes} bytes")
 
# Get feature names for interpretability
feature_names = encoder.get_feature_names_out(['color', 'size'])
print(f"
Feature names: {feature_names}")
 
# Handle unseen categories at inference time
new_data = pd.DataFrame({'color': ['purple', 'red'], 'size': ['XL', 'S']})
new_encoded = encoder.transform(new_data)
print(f"
New data encoding (note zeros for unseen 'purple' and 'XL'):")
print(new_encoded.toarray())
 
# Inverse transform (only possible without drop='first')
encoder_no_drop = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder_no_drop.fit(df[['color', 'size']])
encoded = encoder_no_drop.transform(df[['color', 'size']])
decoded = encoder_no_drop.inverse_transform(encoded)
print(f"
Decoded: {decoded}")

The Dummy Variable Trap

When using OHE with linear models (linear regression, logistic regression), the K indicator columns are perfectly collinear—they sum to 1. This causes the normal equations to be singular. Solution: drop one column (the 'reference category'). Tree-based models don't suffer from this issue. Use drop='first' in sklearn's OneHotEncoder for linear models.

Sparse Matrix Representation:

For high-cardinality OHE, sparse matrix formats are essential. The Compressed Sparse Row (CSR) format stores only non-zero values:

data: array of non-zero values (all 1s for OHE)
indices: column index for each non-zero value
indptr: row pointer array

For a 1M category OHE with 100K samples:

Dense: 100K × 1M × 1 byte = 100 GB
Sparse: 100K × (1 value + 1 index) × 4 bytes = 800 KB

A 125,000x memory reduction.

However, sparse matrices have their own limitations:

Not all algorithms support sparse inputs
Sparse-dense conversions are expensive
Sparse operations have overhead for very small matrices
Slicing and indexing is slower than dense arrays

Binary Encoding — Logarithmic Dimensionality Reduction

Binary encoding is a dimensionality-efficient compromise between label encoding and one-hot encoding. Categories are first integer-encoded, then the integer is represented in binary, with each bit becoming a separate column.

Category    Integer   Binary    b₃  b₂  b₁  b₀
'apple'     →  0      →  0000    0   0   0   0
'banana'    →  1      →  0001    0   0   0   1
'cherry'    →  2      →  0010    0   0   1   0
...
'kiwi'      →  10     →  1010    1   0   1   0

The Mathematical Advantage:

For K categories, binary encoding produces only ⌈log₂(K)⌉ columns:

Cardinality (K)	One-Hot Cols	Binary Cols	Reduction Factor
10	10	4	2.5x
100	100	7	14x
1,000	1,000	10	100x
10,000	10,000	14	714x
1,000,000	1,000,000	20	50,000x

This is a dramatic improvement, especially at high cardinalities.

binary_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import pandas as pd
import numpy as np
import category_encoders as ce
 
# Sample high-cardinality data
np.random.seed(42)
n_samples = 10000
n_categories = 1000  # High cardinality
 
df = pd.DataFrame({
    'category': [f'cat_{i}' for i in np.random.randint(0, n_categories, n_samples)],
    'target': np.random.randn(n_samples)
})
 
# Binary encoding using category_encoders library
binary_encoder = ce.BinaryEncoder(cols=['category'])
df_encoded = binary_encoder.fit_transform(df[['category']])
 
print(f"Original: {n_categories} unique categories")
print(f"Binary encoding columns: {df_encoded.shape[1]}")
print(f"Expected: ceil(log2({n_categories})) = {int(np.ceil(np.log2(n_categories)))}")
print(f"
Column names: {df_encoded.columns.tolist()}")
print(f"
Sample encodings:")
print(df_encoded.head(10))
 
# Memory comparison
one_hot_memory = n_samples * n_categories * 1  # bytes (assuming uint8)
binary_memory = n_samples * int(np.ceil(np.log2(n_categories))) * 1
print(f"
Memory: One-Hot = {one_hot_memory / 1e6:.2f} MB, Binary = {binary_memory / 1e6:.2f} MB")
print(f"Reduction factor: {one_hot_memory / binary_memory:.1f}x")
 
# Manual implementation for understanding
def binary_encode(categories: pd.Series) -> pd.DataFrame:
    """Manual binary encoding implementation."""
    # Step 1: Integer encode
    unique_cats = categories.unique()
    cat_to_int = {cat: i for i, cat in enumerate(unique_cats)}
    integers = categories.map(cat_to_int)
    
    # Step 2: Determine number of bits needed
    n_bits = int(np.ceil(np.log2(len(unique_cats) + 1)))
    
    # Step 3: Convert to binary representation
    result = pd.DataFrame()
    for bit in range(n_bits):
        result[f'bit_{n_bits - 1 - bit}'] = ((integers >> (n_bits - 1 - bit)) & 1).astype(int)
    
    return result
 
# Test manual implementation
df_manual = binary_encode(df['category'])
print(f"
Manual implementation shape: {df_manual.shape}")

The Semantic Problem with Binary Encoding

Binary encoding introduces an implicit structure: categories with adjacent binary representations share bit patterns. 'cat_4' (0100) and 'cat_5' (0101) differ by one bit; 'cat_4' (0100) and 'cat_8' (1000) differ by two bits. This creates artificial similarity based on arbitrary integer assignment, not semantic meaning. For truly unrelated categories, this introduces noise.

When to Use Binary Encoding:

Good candidates:

Features where categories have no meaningful relationship (binary noise is acceptable)
Very high cardinality where OHE is infeasible and you need a non-OHE baseline
Tree-based models (which can learn to undo the binary structure via multiple splits)
Memory-constrained environments

Poor candidates:

Linear models sensitive to feature distances
Cases where category similarity matters (semantic categories)
When you need interpretable feature importances per category

BaseN Encoding Extension:

Binary encoding is a special case of BaseN encoding with N=2. Using higher bases (e.g., base-5, base-10) increases columns but reduces the 'arbitrary similarity' problem. The category_encoders library supports arbitrary bases.

Frequency and Count Encoding

Frequency encoding and count encoding replace each category with a scalar derived from the training data distribution. These methods reduce any cardinality to a single column while preserving some statistical information.

Count Encoding: Replace each category with the number of times it appears in the training set.

Category    Count in Training
'apple'     → 1,523
'banana'    → 3,891  
'cherry'    → 89
'date'      → 456

Frequency Encoding: Replace each category with its relative frequency (proportion).

Category    Frequency
'apple'     → 0.152 (15.2%)
'banana'    → 0.389 (38.9%)
'cherry'    → 0.009 (0.9%)
'date'      → 0.046 (4.6%)

Why These Methods Can Work Surprisingly Well:

Category frequency often correlates with the target variable in real-world problems:

Popularity signal: Frequently purchased products may have different return rates
Trust indicators: Common merchant IDs may indicate established businesses (lower fraud)
Recency proxy: Recently added categories often have lower frequencies
Quality signal: Popular content may genuinely be higher quality

frequency_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
 
class FrequencyEncoder(BaseEstimator, TransformerMixin):
    """
    Production-ready frequency encoder with proper train/test handling.
    """
    def __init__(self, columns=None, normalize=True, handle_unknown='global_mean'):
        """
        Parameters:
        -----------
        columns : list of str
            Columns to encode. If None, encode all object/category columns.
        normalize : bool
            If True, use frequencies (0-1). If False, use raw counts.
        handle_unknown : str
            Strategy for unseen categories: 'global_mean', 'zero', 'min', 'error'
        """
        self.columns = columns
        self.normalize = normalize
        self.handle_unknown = handle_unknown
        self.encoding_maps_ = {}
        self.global_defaults_ = {}
        
    def fit(self, X, y=None):
        """Learn frequency/count mappings from training data."""
        X = X.copy()
        if self.columns is None:
            self.columns = X.select_dtypes(include=['object', 'category']).columns.tolist()
        
        for col in self.columns:
            counts = X[col].value_counts(normalize=self.normalize)
            self.encoding_maps_[col] = counts.to_dict()
            
            # Set default for unseen categories
            if self.handle_unknown == 'global_mean':
                self.global_defaults_[col] = counts.mean()
            elif self.handle_unknown == 'zero':
                self.global_defaults_[col] = 0
            elif self.handle_unknown == 'min':
                self.global_defaults_[col] = counts.min()
            else:
                self.global_defaults_[col] = None
                
        return self
    
    def transform(self, X):
        """Apply learned frequency mappings."""
        X = X.copy()
        for col in self.columns:
            X[col] = X[col].map(self.encoding_maps_[col])
            
            # Handle unseen categories
            if self.handle_unknown == 'error':
                if X[col].isna().any():
                    raise ValueError(f"Unseen categories found in column '{col}'")
            else:
                X[col] = X[col].fillna(self.global_defaults_[col])
                
        return X
 
# Example usage
np.random.seed(42)
train_df = pd.DataFrame({
    'product_id': np.random.choice(['P001', 'P002', 'P003', 'P004'], 1000, 
                                    p=[0.5, 0.3, 0.15, 0.05]),
    'merchant': np.random.choice(['M1', 'M2', 'M3'], 1000, p=[0.6, 0.3, 0.1]),
    'target': np.random.randint(0, 2, 1000)
})
 
# Fit encoder on training data
encoder = FrequencyEncoder(normalize=True, handle_unknown='global_mean')
train_encoded = encoder.fit_transform(train_df)
 
print("Training data frequencies:")
print(train_encoded.head(10))
print(f"
Learned mappings:")
for col, mapping in encoder.encoding_maps_.items():
    print(f"  {col}: {mapping}")
 
# Apply to test data (including unseen category 'P999')
test_df = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P999', 'P003'],  # P999 is unseen
    'merchant': ['M1', 'M2', 'M4', 'M3'],  # M4 is unseen
    'target': [0, 1, 1, 0]
})
 
test_encoded = encoder.transform(test_df)
print("
Test data with unseen categories:")
print(test_encoded)

Collision Problem

Multiple categories with identical frequencies get the same encoding. If 'apple' and 'kiwi' both appear 1,523 times, they become indistinguishable. This information loss may or may not matter depending on the problem. Mitigation: combine frequency with other encodings or add small random noise.

Advanced Frequency-Based Methods:

Rank Encoding: Replace with the rank of frequency (1st most common, 2nd most common, etc.). More robust to outlier counts.
Log-Frequency Encoding: Use log(count + 1) to compress the range for power-law distributed categories.
Normalized Rank: Rank divided by number of categories, scaled to [0, 1].
Weight of Evidence (WoE): For binary classification, encode based on the log-odds ratio of the target within each category. (Covered in Target Encoding page.)

Practical Considerations:

Train/test consistency: Always compute frequencies from training data only, then apply to test
Time-based data: Consider computing frequencies from the same time window
Regularization: Smooth low-frequency categories toward the global mean
Combined features: Can be used alongside other encodings as additional features

Encoding Strategy Selection Framework

Selecting the right encoding strategy requires understanding the interplay between data characteristics, model requirements, and deployment constraints. The following decision framework provides a structured approach.

Encoding Selection by Model Type and Cardinality
Scenario	Tree-Based Models	Linear Models	Neural Networks
Low Cardinality (K < 10)	One-Hot or Label	One-Hot (drop one)	One-Hot or Embedding
Medium Cardinality (10 ≤ K < 100)	One-Hot or Label	One-Hot (sparse)	Embedding
High Cardinality (100 ≤ K < 10K)	Label + Target Encoding	Target Encoding / Hashing	Embedding (required)
Very High Cardinality (K ≥ 10K)	Target + Frequency	Hash Encoding	Embedding (required)
Ordinal Categories	Label (respects order)	Label or Ordinal OHE	Ordinal Embedding

Decision Factors for Encoding Strategy

•Cardinality vs. Sample Size — High cardinality relative to samples leads to sparse representations and overfitting risk. Use regularized encodings.
•Category Semantics — If categories have meaningful relationships (e.g., product hierarchies), consider embeddings or structured encodings that preserve them.
•Target Relationship — If category strongly predicts target, target encoding extracts this signal directly. Be wary of leakage.
•Novel Category Rate — High rate of unseen categories at inference demands robust fallback strategies (hash encoding, default mappings).
•Latency Requirements — Some encodings require lookups (target encoding) while others are purely computational (hashing). Consider inference-time constraints.
•Interpretability Needs — One-hot encoding produces interpretable feature importances per category; embeddings do not.
•Model Architecture — Deep learning benefits from embeddings; gradient boosting often works well with label + target encoding.

Ensemble Encoding Strategy

In practice, top Kaggle competitors and production systems often use multiple encodings simultaneously. A single high-cardinality feature might be encoded as: (1) target encoding for the primary signal, (2) frequency encoding as additional context, and (3) one-hot for the top-10 most frequent categories. The model learns which representations are useful.

Implementation Best Practices

Robust categorical encoding in production requires attention to details often overlooked in prototypes. The following practices distinguish production-grade implementations from experimental code.

Production Encoding Checklist

•Fit on Training Only — Never compute statistics (frequencies, target means) on validation or test data. This is the most common source of data leakage.
•Persist Encoders — Save encoder objects (not just mappings) with model artifacts. Recreating encoders at inference time risks inconsistency.
•Version Encoders — When training data changes, encoder mappings change. Track encoder versions alongside model versions.
•Test Unknown Handling — Explicitly test behavior when unseen categories appear. Don't discover this bug in production.
•Monitor Distribution Shift — Track category frequency distributions over time. Significant shifts may require retraining.
•Handle Null Categories — Decide upfront: treat NULL as a distinct category, impute before encoding, or raise errors.
•Document Category Ordering — For ordinal encoding, document and enforce the expected order. Implicit Python dict/set ordering is not stable.
•Bound Memory Usage — For one-hot encoding, enforce maximum categories. Use max_categories parameter or pre-filtering.

production_encoding_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import joblib
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
import category_encoders as ce
from datetime import datetime
 
class EncodingPipeline:
    """
    Production-ready encoding pipeline with proper versioning and persistence.
    """
    def __init__(self, version: str = None):
        self.version = version or datetime.now().strftime("%Y%m%d_%H%M%S")
        self.pipeline = None
        self.metadata = {
            "version": self.version,
            "created_at": datetime.now().isoformat(),
            "feature_columns": None,
            "category_mappings": {}
        }
    
    def build_pipeline(self, 
                       low_cardinality_cols: list,
                       high_cardinality_cols: list,
                       numeric_cols: list,
                       target_encode_cols: list = None):
        """
        Construct sklearn ColumnTransformer with appropriate encoders.
        """
        transformers = []
        
        # Low cardinality: One-Hot Encoding
        if low_cardinality_cols:
            transformers.append((
                'low_card_ohe',
                OneHotEncoder(
                    sparse_output=True,
                    handle_unknown='ignore',
                    drop='if_binary',
                    min_frequency=0.01  # Group rare categories
                ),
                low_cardinality_cols
            ))
        
        # High cardinality: Target Encoding (must fit with target!)
        if high_cardinality_cols:
            transformers.append((
                'high_card_target',
                ce.TargetEncoder(
                    cols=high_cardinality_cols,
                    smoothing=100,  # Regularization
                    handle_unknown='value',  # Use global mean
                    handle_missing='value'
                ),
                high_cardinality_cols
            ))
        
        # Numeric: Standard scaling with imputation
        if numeric_cols:
            transformers.append((
                'numeric',
                Pipeline([
                    ('impute', SimpleImputer(strategy='median')),
                    ('scale', StandardScaler())
                ]),
                numeric_cols
            ))
        
        self.pipeline = ColumnTransformer(
            transformers=transformers,
            remainder='drop',  # Explicitly drop unused columns
            verbose_feature_names_out=True
        )
        
        self.metadata["feature_columns"] = {
            "low_cardinality": low_cardinality_cols,
            "high_cardinality": high_cardinality_cols,
            "numeric": numeric_cols
        }
        
        return self
    
    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        """Fit all encoders. Target y required for target encoding."""
        if self.pipeline is None:
            raise ValueError("Build pipeline before fitting.")
        
        # Note: ColumnTransformer passes y to each transformer that needs it
        self.pipeline.fit(X, y)
        
        # Store category mappings for monitoring
        for name, transformer, cols in self.pipeline.transformers_:
            if isinstance(transformer, OneHotEncoder):
                for col, categories in zip(cols, transformer.categories_):
                    self.metadata["category_mappings"][col] = list(categories)
        
        self.metadata["n_output_features"] = len(self.get_feature_names())
        return self
    
    def transform(self, X: pd.DataFrame) -> np.ndarray:
        """Transform data using fitted encoders."""
        return self.pipeline.transform(X)
    
    def fit_transform(self, X: pd.DataFrame, y: pd.Series = None) -> np.ndarray:
        """Fit and transform in one step."""
        self.fit(X, y)
        return self.transform(X)
    
    def get_feature_names(self) -> list:
        """Get output feature names."""
        return list(self.pipeline.get_feature_names_out())
    
    def save(self, path: str):
        """Persist pipeline and metadata."""
        artifact = {
            "pipeline": self.pipeline,
            "metadata": self.metadata
        }
        joblib.dump(artifact, path)
        print(f"Saved encoding pipeline v{self.version} to {path}")
    
    @classmethod
    def load(cls, path: str) -> 'EncodingPipeline':
        """Load persisted pipeline."""
        artifact = joblib.load(path)
        instance = cls(version=artifact["metadata"]["version"])
        instance.pipeline = artifact["pipeline"]
        instance.metadata = artifact["metadata"]
        return instance
 
# Usage example
np.random.seed(42)
df = pd.DataFrame({
    'category_low': np.random.choice(['A', 'B', 'C'], 1000),
    'category_high': [f'id_{i}' for i in np.random.randint(0, 500, 1000)],
    'numeric_1': np.random.randn(1000),
    'numeric_2': np.random.randn(1000) * 10,
    'target': np.random.randint(0, 2, 1000)
})
 
# Build and fit pipeline
enc_pipeline = EncodingPipeline(version="v1.0.0")
enc_pipeline.build_pipeline(
    low_cardinality_cols=['category_low'],
    high_cardinality_cols=['category_high'],
    numeric_cols=['numeric_1', 'numeric_2']
)
 
X = df.drop('target', axis=1)
y = df['target']
X_encoded = enc_pipeline.fit_transform(X, y)
 
print(f"Pipeline version: {enc_pipeline.version}")
print(f"Output shape: {X_encoded.shape}")
print(f"Feature names: {enc_pipeline.get_feature_names()[:10]}...")
 
# Save for production
enc_pipeline.save("encoding_pipeline.joblib")
 
# Load in production
loaded_pipeline = EncodingPipeline.load("encoding_pipeline.joblib")
print(f"Loaded version: {loaded_pipeline.version}")

Summary and Next Steps

This page has established the foundational concepts of categorical encoding, covering the spectrum from simple label encoding to production-ready pipelines. We've examined the tradeoffs between dimensionality, information preservation, and computational efficiency.

Key Takeaways

•Categorical encoding is fundamental — Most real-world ML features are categorical; proper encoding directly impacts model performance.
•Cardinality determines strategy — Low cardinality favors one-hot; high cardinality demands sophisticated methods like target encoding or embeddings.
•No spurious ordinality — Linear models and neural networks are sensitive to arbitrary ordinal relationships from label encoding.
•Sparse representations are essential — One-hot encoding of high-cardinality features requires sparse matrix formats for memory efficiency.
•Train/test separation is critical — All encoding statistics must come from training data only to prevent leakage.
•Production requires robustness — Handle unseen categories, persist encoders, and monitor distribution shifts.

Coming Next: Target Encoding

The next page dives into Target Encoding, a powerful technique that encodes categories using the target variable itself. We'll cover the mathematical foundations, regularization strategies to prevent overfitting, proper cross-validation schemes, and the relationship to Weight of Evidence encoding. Target encoding is one of the most impactful tools for high-cardinality features in supervised learning.