Automated Preprocessing - Learning Module

Loading content...

0/245

Encoding Selection

The Bridge Between Categories and Numbers

Machine learning algorithms speak the language of numbers. They compute distances, gradients, and dot products—operations that require numerical inputs. Yet much of the world's most valuable data comes in categorical form: countries, product types, job titles, user segments, disease classifications.

The encoding challenge: Transforming categories into numbers isn't neutral. The choice of encoding fundamentally changes what a model can learn. One-hot encoding with 10,000 product categories creates 10,000 sparse features; target encoding collapses them to one dense column. Tree-based models thrive on ordinal encoding; linear models may catastrophically misinterpret it. AutoML systems must navigate this landscape automatically, selecting encodings that match the downstream model and dataset characteristics.

What You Will Learn

By the end of this page, you will understand: the complete taxonomy of categorical encodings, when each encoding type excels or fails, how model type determines optimal encoding, target-based encodings and their leakage risks, and how AutoML systems automatically select encodings for diverse feature sets.

The Encoding Taxonomy

Categorical encodings fall into three fundamental families, each with distinct properties, assumptions, and use cases. Understanding this taxonomy is essential for both manual feature engineering and understanding AutoML encoding decisions.

The Three Encoding Families:

Encoding Family Overview

•Label/Ordinal Encodings — Map categories to integers. Simple and memory-efficient, but implicitly assume ordering that may not exist. Works well with tree-based models that can learn arbitrary splits.
•One-Hot/Dummy Encodings — Create binary columns for each category. No ordering assumption, but explodes dimensionality. Essential for linear models, dangerous for high-cardinality features.
•Target/Supervised Encodings — Replace categories with target statistics. Compress high cardinality to single column while encoding predictive information. Powerful but risk data leakage.

Encoding Methods Comparison
Encoding	Output Dimensions	Ordering Assumed	Uses Target	Best For
Label Encoding	1	Yes (implicit)	No	Tree models, ordinal data
One-Hot Encoding	k (categories)	No	No	Linear models, low cardinality
Target Encoding	1	No	Yes	High cardinality, any model
Binary Encoding	log₂(k)	Partial	No	Medium cardinality, trees
Hash Encoding	Fixed n	No	No	Very high cardinality, online learning
Embedding	d (learned)	No	Yes	Very high cardinality, neural nets

The Cardinality Consideration

Cardinality—the number of unique categories—is the primary driver of encoding selection. A 'country' feature with 200 values handles differently than a 'user_id' with 10 million. AutoML systems must estimate cardinality impact on downstream model training time, memory, and generalization.

Ordinal and Label Encoding

The simplest encoding assigns each category an integer. This approach is trivially efficient but carries implicit assumptions that can dramatically affect model behavior.

Label Encoding: Categories are mapped to integers 0, 1, 2, ..., k-1 in order of appearance or alphabetically. No inherent meaning to the numbers—'apple' = 0, 'banana' = 1, 'cherry' = 2 is arbitrary.

Ordinal Encoding: Categories are mapped to integers according to a meaningful order. 'low' = 0, 'medium' = 1, 'high' = 2 preserves the semantic ordering.

The Critical Distinction: Label encoding is arbitrary ordinal encoding. The danger: models may interpret the arbitrary numbers as meaningful. A linear model sees 'cherry' (2) as twice 'banana' (1), which is nonsense for unordered categories.

ordinal_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from typing import Dict, List, Optional
 
class AutoMLOrdinalEncoder:
    """
    Intelligent ordinal encoding with automatic order detection.
    
    Features:
    1. Detects natural orderings (size, frequency, temporal)
    2. Uses frequency-based ordering when no semantic order exists
    3. Handles unknown categories at inference time
    """
    
    # Known ordinal patterns
    ORDINAL_PATTERNS = {
        'size': ['xs', 'extra small', 's', 'small', 'm', 'medium', 
                 'l', 'large', 'xl', 'extra large', 'xxl'],
        'frequency': ['never', 'rarely', 'sometimes', 'often', 'always'],
        'agreement': ['strongly disagree', 'disagree', 'neutral', 
                      'agree', 'strongly agree'],
        'quality': ['poor', 'fair', 'good', 'very good', 'excellent'],
        'priority': ['low', 'medium', 'high', 'critical'],
    }
    
    def __init__(
        self,
        unknown_value: int = -1,
        use_frequency_order: bool = True
    ):
        self.unknown_value = unknown_value
        self.use_frequency_order = use_frequency_order
        self.mappings_: Dict[str, Dict] = {}
        self.is_ordinal_: Dict[str, bool] = {}
        
    def _detect_ordinal_pattern(self, values: pd.Series) -> Optional[List[str]]:
        """
        Detect if values match a known ordinal pattern.
        
        Returns the ordered list if pattern detected, None otherwise.
        """
        clean_values = set(v.lower().strip() for v in values.dropna().unique())
        
        for pattern_name, pattern_order in self.ORDINAL_PATTERNS.items():
            pattern_set = set(pattern_order)
            # Check if values are subset of pattern
            if clean_values.issubset(pattern_set):
                # Return only the values that exist, in order
                return [v for v in pattern_order if v in clean_values]
        
        return None
    
    def fit(self, X: pd.DataFrame) -> 'AutoMLOrdinalEncoder':
        """
        Fit encoder, detecting ordinal patterns and building mappings.
        """
        for col in X.columns:
            values = X[col].astype(str)
            
            # Try to detect ordinal pattern
            ordinal_order = self._detect_ordinal_pattern(values)
            
            if ordinal_order is not None:
                # Use detected semantic ordering
                self.is_ordinal_[col] = True
                self.mappings_[col] = {
                    v: i for i, v in enumerate(ordinal_order)
                }
            elif self.use_frequency_order:
                # Order by frequency (most common = lowest value)
                self.is_ordinal_[col] = False
                value_counts = values.value_counts()
                self.mappings_[col] = {
                    v: i for i, v in enumerate(value_counts.index)
                }
            else:
                # Arbitrary order (alphabetical)
                self.is_ordinal_[col] = False
                unique_vals = sorted(values.unique())
                self.mappings_[col] = {
                    v: i for i, v in enumerate(unique_vals)
                }
        
        return self
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Transform categories to ordinal integers.
        """
        X_encoded = X.copy()
        
        for col in X.columns:
            if col in self.mappings_:
                mapping = self.mappings_[col]
                X_encoded[col] = X[col].astype(str).map(
                    lambda v: mapping.get(v, self.unknown_value)
                )
        
        return X_encoded
    
    def get_encoding_report(self) -> pd.DataFrame:
        """
        Report which features were detected as truly ordinal.
        """
        return pd.DataFrame({
            'feature': list(self.is_ordinal_.keys()),
            'is_semantic_ordinal': list(self.is_ordinal_.values()),
            'n_categories': [len(m) for m in self.mappings_.values()]
        })

The Linear Model Trap

Using label encoding with linear models (linear regression, logistic regression, SVM) on nominal (unordered) categories is a common mistake. The model will learn a single coefficient for the encoded integer, implying a linear relationship between arbitrary category assignments. Always use one-hot encoding for nominal features with linear models.

One-Hot and Dummy Encoding

One-hot encoding eliminates the ordering problem by creating k binary columns for k categories. Each observation has exactly one '1' and k-1 '0's. This representation makes no assumptions about category relationships—each category gets its own learned parameter.

One-Hot vs. Dummy Encoding:

One-Hot: Creates k columns for k categories. Can cause multicollinearity (columns sum to 1).
Dummy: Creates k-1 columns, dropping one reference category. Avoids multicollinearity for linear models.

When One-Hot Works:

Low to medium cardinality (< 100 categories)
Linear models that need independent category effects
Interpretation is important (each category has a coefficient)

When One-Hot Fails:

High cardinality (thousands of categories): creates sparse, high-dimensional data
Limited training data: many categories have few examples
Tree-based models: doesn't help (trees can handle ordinal encoding directly)

one_hot_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
 
class AutoMLOneHotEncoder:
    """
    One-hot encoding with AutoML-style cardinality management.
    
    Features:
    1. Automatic cardinality threshold for switching strategies
    2. Infrequent category grouping to limit dimensionality
    3. Sparse output for memory efficiency
    """
    
    def __init__(
        self,
        max_categories: int = 100,
        min_frequency: float = 0.01,
        handle_unknown: str = 'infrequent_if_exist'
    ):
        self.max_categories = max_categories
        self.min_frequency = min_frequency
        self.handle_unknown = handle_unknown
        self.encoders_ = {}
        self.feature_names_ = []
    
    def fit(self, X: pd.DataFrame) -> 'AutoMLOneHotEncoder':
        """
        Fit one-hot encoder with intelligent cardinality handling.
        """
        for col in X.columns:
            cardinality = X[col].nunique()
            
            encoder = OneHotEncoder(
                sparse_output=True,
                handle_unknown=self.handle_unknown,
                min_frequency=self.min_frequency if cardinality > self.max_categories else None,
                max_categories=self.max_categories if cardinality > self.max_categories else None,
                drop='if_binary'  # Drop one column for binary features
            )
            
            encoder.fit(X[[col]])
            self.encoders_[col] = encoder
            
            # Store feature names
            feature_names = encoder.get_feature_names_out([col])
            self.feature_names_.extend(feature_names)
        
        return self
    
    def transform(self, X: pd.DataFrame) -> sparse.csr_matrix:
        """
        Transform to sparse one-hot matrix.
        """
        encoded_parts = []
        
        for col in X.columns:
            if col in self.encoders_:
                encoded = self.encoders_[col].transform(X[[col]])
                encoded_parts.append(encoded)
        
        return sparse.hstack(encoded_parts, format='csr')
    
    def get_feature_names(self) -> list:
        return self.feature_names_

Memory Efficiency with Sparse Matrices

One-hot encoded data is inherently sparse—each row has mostly zeros. Using sparse matrix representations (scipy.sparse) can reduce memory by 10-100x compared to dense arrays. AutoML systems default to sparse outputs, converting to dense only when required by specific algorithms.

Target-Based Encodings

Target encoding replaces each category with a statistic computed from the target variable—typically the mean for regression or class probability for classification. This seemingly simple idea is remarkably powerful: it compresses high-cardinality features into a single column while encoding predictive information.

The Promise:

Handles any cardinality with constant output dimension
Directly encodes category-target relationship
Works with any model type
No ordering assumptions

The Peril:

Direct data leakage if not implemented carefully
Overfitting on rare categories
Can dominate model attention (encoded values highly predictive by construction)

Target Encoding Variants

•Mean Target Encoding — Replace category with mean target value. Simple but prone to overfitting.
•Leave-One-Out Encoding — Compute mean excluding current row. Reduces but doesn't eliminate leakage.
•K-Fold Target Encoding — Use CV-style splits: encode each fold using target from other folds.
•Smoothed Target Encoding — Blend category mean with global mean based on category count.
•Weight of Evidence (WoE) — Log-odds ratio between positive and negative class per category.

target_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from typing import Dict
 
class AutoMLTargetEncoder:
    """
    Production-grade target encoding with regularization.
    
    Implements smoothed target encoding with K-fold strategy
    to prevent data leakage during training.
    """
    
    def __init__(
        self,
        n_folds: int = 5,
        smoothing: float = 10.0,
        min_samples_leaf: int = 1,
        random_state: int = 42
    ):
        self.n_folds = n_folds
        self.smoothing = smoothing
        self.min_samples_leaf = min_samples_leaf
        self.random_state = random_state
        self.global_mean_: float = None
        self.category_stats_: Dict[str, Dict] = {}
        
    def _smoothed_mean(
        self,
        category_mean: float,
        category_count: int,
        global_mean: float
    ) -> float:
        """
        Compute smoothed mean, blending category and global statistics.
        
        Formula: (category_mean * count + global_mean * smoothing) / 
                 (count + smoothing)
        
        When count is low, result is close to global_mean (regularization).
        When count is high, result approaches category_mean.
        """
        return (
            (category_mean * category_count + global_mean * self.smoothing) /
            (category_count + self.smoothing)
        )
    
    def fit(self, X: pd.DataFrame, y: pd.Series) -> 'AutoMLTargetEncoder':
        """
        Compute target statistics for each category.
        
        These are used only for transform on NEW data.
        Training data is encoded using K-fold strategy.
        """
        self.global_mean_ = y.mean()
        
        for col in X.columns:
            category_stats = {}
            
            for category in X[col].unique():
                mask = X[col] == category
                if mask.sum() >= self.min_samples_leaf:
                    cat_mean = y[mask].mean()
                    cat_count = mask.sum()
                    category_stats[category] = {
                        'mean': cat_mean,
                        'count': cat_count,
                        'smoothed': self._smoothed_mean(
                            cat_mean, cat_count, self.global_mean_
                        )
                    }
            
            self.category_stats_[col] = category_stats
        
        return self
    
    def fit_transform(self, X: pd.DataFrame, y: pd.Series) -> pd.DataFrame:
        """
        Fit and transform training data using K-fold strategy.
        
        Each fold is encoded using statistics from OTHER folds only,
        preventing target leakage.
        """
        self.fit(X, y)
        
        X_encoded = X.copy()
        kf = KFold(n_splits=self.n_folds, shuffle=True, 
                   random_state=self.random_state)
        
        for col in X.columns:
            encoded_col = np.zeros(len(X))
            
            for train_idx, val_idx in kf.split(X):
                # Compute stats from training fold only
                train_stats = self._compute_fold_stats(
                    X.iloc[train_idx][col],
                    y.iloc[train_idx]
                )
                
                # Apply to validation fold
                encoded_col[val_idx] = X.iloc[val_idx][col].map(
                    lambda v: train_stats.get(v, self.global_mean_)
                )
            
            X_encoded[col] = encoded_col
        
        return X_encoded
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Transform new data using statistics from fit.
        """
        X_encoded = X.copy()
        
        for col in X.columns:
            stats = self.category_stats_.get(col, {})
            X_encoded[col] = X[col].map(
                lambda v: stats.get(v, {}).get('smoothed', self.global_mean_)
            )
        
        return X_encoded
    
    def _compute_fold_stats(
        self,
        categories: pd.Series,
        targets: pd.Series
    ) -> Dict[str, float]:
        """Compute smoothed means for a single fold."""
        global_mean = targets.mean()
        stats = {}
        
        for cat in categories.unique():
            mask = categories == cat
            if mask.sum() >= self.min_samples_leaf:
                stats[cat] = self._smoothed_mean(
                    targets[mask].mean(),
                    mask.sum(),
                    global_mean
                )
        
        return stats

The Leakage Trap in Target Encoding

Naive target encoding (compute category means on all data, then apply) causes severe data leakage. Each row's target contributes to its own encoded value, creating circular logic. The model appears to perform brilliantly in training but fails catastrophically on new data. Always use K-fold or leave-one-out strategies for training data.

Model-Aware Encoding Selection

Different model families have fundamentally different relationships with categorical encodings. AutoML systems must consider the downstream model when selecting encodings—a system preparing data for XGBoost should make different choices than one targeting logistic regression.

The Model-Encoding Compatibility Matrix:

Optimal Encoding by Model Type
Model Type	Recommended Encoding	Why	Avoid
Linear (LR, SVM)	One-hot or target	Needs numeric features; can't learn category splits	Label encoding (implies ordering)
Tree-based (RF, XGB)	Label/ordinal	Can learn arbitrary splits; one-hot adds depth	One-hot (inefficient splits)
Neural networks	Embedding layers	Learn representations; handle any cardinality	One-hot for high cardinality (too sparse)
KNN	One-hot (scaled)	Distance needs numeric; ordinal distorts distance	Label encoding (arbitrary distances)
Naive Bayes	Label encoding	Each category gets probability; ordinal ok	One-hot (creates dependency)

model_aware_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
from dataclasses import dataclass
from enum import Enum
from typing import List, Dict, Any
import pandas as pd
 
class ModelFamily(Enum):
    LINEAR = "linear"
    TREE = "tree"
    NEURAL = "neural"
    DISTANCE = "distance"
    BAYESIAN = "bayesian"
 
class EncodingType(Enum):
    LABEL = "label"
    ONEHOT = "onehot"
    TARGET = "target"
    EMBEDDING = "embedding"
    HASH = "hash"
 
@dataclass
class EncodingRecommendation:
    encoding: EncodingType
    reason: str
    parameters: Dict[str, Any]
 
class ModelAwareEncodingSelector:
    """
    Select encoding strategy based on model type and feature properties.
    
    Implements the model-encoding compatibility matrix with
    cardinality-based fallbacks.
    """
    
    # Cardinality thresholds
    LOW_CARDINALITY = 10
    MEDIUM_CARDINALITY = 100
    HIGH_CARDINALITY = 1000
    
    def __init__(self, model_family: ModelFamily):
        self.model_family = model_family
    
    def recommend(
        self,
        feature_name: str,
        cardinality: int,
        is_ordinal: bool = False,
        n_samples: int = None
    ) -> EncodingRecommendation:
        """
        Recommend encoding for a single feature.
        """
        if is_ordinal:
            # Ordinal features: always use ordinal encoding
            return EncodingRecommendation(
                encoding=EncodingType.LABEL,
                reason="Feature has natural ordering",
                parameters={'ordered': True}
            )
        
        # Model-specific logic
        if self.model_family == ModelFamily.LINEAR:
            return self._recommend_for_linear(cardinality, n_samples)
        elif self.model_family == ModelFamily.TREE:
            return self._recommend_for_tree(cardinality)
        elif self.model_family == ModelFamily.NEURAL:
            return self._recommend_for_neural(cardinality)
        elif self.model_family == ModelFamily.DISTANCE:
            return self._recommend_for_distance(cardinality)
        else:
            return self._recommend_default(cardinality)
    
    def _recommend_for_linear(
        self,
        cardinality: int,
        n_samples: int
    ) -> EncodingRecommendation:
        """Linear models: one-hot for low cardinality, target for high."""
        if cardinality <= self.LOW_CARDINALITY:
            return EncodingRecommendation(
                encoding=EncodingType.ONEHOT,
                reason="Low cardinality, linear model handles well",
                parameters={'drop': 'first'}
            )
        elif cardinality <= self.MEDIUM_CARDINALITY:
            return EncodingRecommendation(
                encoding=EncodingType.ONEHOT,
                reason="Medium cardinality with infrequent grouping",
                parameters={'max_categories': 50, 'min_frequency': 0.01}
            )
        else:
            return EncodingRecommendation(
                encoding=EncodingType.TARGET,
                reason="High cardinality; target encoding avoids dimension explosion",
                parameters={'smoothing': 10.0, 'n_folds': 5}
            )
    
    def _recommend_for_tree(
        self,
        cardinality: int
    ) -> EncodingRecommendation:
        """Tree models: label encoding is efficient and effective."""
        if cardinality <= self.HIGH_CARDINALITY:
            return EncodingRecommendation(
                encoding=EncodingType.LABEL,
                reason="Trees learn arbitrary splits; label encoding most efficient",
                parameters={'order': 'frequency'}
            )
        else:
            return EncodingRecommendation(
                encoding=EncodingType.TARGET,
                reason="Very high cardinality; target encoding aids tree splits",
                parameters={'smoothing': 20.0}
            )
    
    def _recommend_for_neural(
        self,
        cardinality: int
    ) -> EncodingRecommendation:
        """Neural networks: embeddings for large cardinality."""
        if cardinality <= self.LOW_CARDINALITY:
            return EncodingRecommendation(
                encoding=EncodingType.ONEHOT,
                reason="Low cardinality; one-hot is simpler than embedding",
                parameters={}
            )
        else:
            # Embedding dimension heuristic: sqrt(cardinality), capped
            embed_dim = min(50, max(4, int(cardinality ** 0.25)))
            return EncodingRecommendation(
                encoding=EncodingType.EMBEDDING,
                reason="Neural net learns embedding; dimensions << cardinality",
                parameters={'embedding_dim': embed_dim}
            )
    
    def _recommend_for_distance(
        self,
        cardinality: int
    ) -> EncodingRecommendation:
        """Distance-based models: one-hot for meaningful distances."""
        if cardinality <= self.MEDIUM_CARDINALITY:
            return EncodingRecommendation(
                encoding=EncodingType.ONEHOT,
                reason="One-hot gives meaningful Hamming distance",
                parameters={}
            )
        else:
            return EncodingRecommendation(
                encoding=EncodingType.HASH,
                reason="Hash encoding limits dimensions while preserving distance",
                parameters={'n_components': 32}
            )

AutoML Ensemble Consideration

When AutoML builds ensembles of diverse model types, it often creates multiple encoded versions of categorical features—one-hot for linear components, label encoding for trees. This adds memory overhead but allows each model to receive optimal input representations.

Summary: Automated Encoding Selection

Encoding selection is a critical preprocessing decision that AutoML systems must make automatically. The right choice depends on cardinality, feature semantics, and the downstream model type.

Key Takeaways

•Three encoding families — Label/ordinal, one-hot, and target-based encodings serve different purposes and have different assumptions.
•Cardinality drives decisions — Low cardinality allows one-hot; high cardinality requires target encoding, hashing, or embeddings.
•Model type matters — Linear models need one-hot; trees prefer label encoding; neural networks use embeddings.
•Target encoding is powerful but dangerous — Use K-fold strategy to prevent leakage; apply smoothing for rare categories.
•Ordinal features deserve special treatment — Detect semantic ordering and preserve it in encoding.
•Validation must be proper — Fit encoders on training data only; use pipelines to prevent leakage.

Page Complete

You now understand how AutoML systems approach encoding selection—from understanding the encoding taxonomy to model-aware strategy selection. Next, we'll explore automated scaling selection, where AutoML must choose appropriate feature transformations for different data distributions and model requirements.