ML Systems & ProductionFeature Transformation Pipelines

Feature Transformation Pipelines

LevelIntermediate

Duration90 mins

TopicFeature Transformation Pipelines

3 / 5

Custom Transformers

Beyond Built-in Transformations

Scikit-learn provides an impressive arsenal of built-in transformers: scalers, encoders, imputers, feature selectors, and more. Yet real-world machine learning inevitably encounters transformations that no library anticipated.

Consider these scenarios:

Domain-specific feature engineering: Converting chemical compound structures to fingerprint vectors, extracting features from genomic sequences, or parsing financial instrument specifications
Business logic transformations: Calculating customer lifetime value estimates, applying proprietary risk scoring formulas, or implementing industry-standard normalizations
Data quality corrections: Fixing known data entry patterns, converting legacy encoding schemes, or reconciling inconsistent formats

None of these fit neatly into StandardScaler or OneHotEncoder. You need the ability to create custom transformers that encapsulate your domain logic while integrating seamlessly with Pipelines and ColumnTransformers.

What You Will Learn

By the end of this page, you will understand the scikit-learn transformer contract at a deep level and be able to implement robust custom transformers for any use case. You'll learn the difference between stateful and stateless transformers, how to ensure compatibility with cross-validation and grid search, and production-grade implementation patterns.

The Transformer Contract Revisited

Before implementing custom transformers, we must deeply understand the contract they must satisfy. A scikit-learn transformer is any object that implements specific methods with specific signatures and behaviors.

Required Methods:

Method	Signature	Behavior
`fit`	`fit(X, y=None)`	Learn parameters from data. Returns `self`.
`transform`	`transform(X)`	Apply transformation. Returns transformed data.

Optional But Expected:

Method	Signature	Behavior
`fit_transform`	`fit_transform(X, y=None)`	Convenience: fit then transform. May be optimized.
`get_params`	`get_params(deep=True)`	Return dict of constructor parameters.
`set_params`	`set_params(**params)`	Set parameters. Returns `self`.
`get_feature_names_out`	`get_feature_names_out(input_features=None)`	Return output feature names.

transformer_contract.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
 
class LogTransformer(BaseEstimator, TransformerMixin):
    """
    Apply log transformation to numerical features.
    
    Demonstrates the minimal transformer contract:
    - Inherits from BaseEstimator (get_params, set_params)
    - Inherits from TransformerMixin (fit_transform default implementation)
    - Implements fit() and transform()
    
    Parameters
    ----------
    offset : float, default=1.0
        Value added before log to handle zeros: log(x + offset)
    base : {'natural', '10', '2'}, default='natural'
        Logarithm base to use
    
    Attributes
    ----------
    n_features_in_ : int
        Number of features seen during fit
    feature_names_in_ : ndarray of shape (n_features_in_,)
        Names of features seen during fit (if available)
    """
    
    def __init__(self, offset=1.0, base='natural'):
        # Rule 1: Store constructor parameters as attributes with SAME NAME
        self.offset = offset
        self.base = base
        # Rule 2: NO computation in __init__. Only parameter storage.
    
    def fit(self, X, y=None):
        """
        Fit: learn any necessary parameters from training data.
        
        For log transform, we don't learn anything data-dependent,
        but we still validate and record metadata.
        """
        # Validate input using sklearn's validation utilities
        X = self._validate_data(X, reset=True)
        
        # _validate_data automatically sets:
        # - self.n_features_in_ (number of input features)
        # - self.feature_names_in_ (if X has feature names, e.g., DataFrame)
        
        # Validate parameter values
        if self.offset <= 0:
            raise ValueError(f"offset must be positive, got {self.offset}")
        if self.base not in {'natural', '10', '2'}:
            raise ValueError(f"base must be 'natural', '10', or '2', got {self.base}")
        
        # Rule 3: MUST return self for method chaining
        return self
    
    def transform(self, X):
        """
        Transform: apply the log transformation.
        """
        # Validate against fitted state
        X = self._validate_data(X, reset=False)  # reset=False checks against fit
        
        # Rule 4: Don't modify input; work on copy
        X_transformed = X + self.offset
        
        if self.base == 'natural':
            X_transformed = np.log(X_transformed)
        elif self.base == '10':
            X_transformed = np.log10(X_transformed)
        elif self.base == '2':
            X_transformed = np.log2(X_transformed)
        
        return X_transformed
    
    def get_feature_names_out(self, input_features=None):
        """
        Return feature names for output features.
        """
        # Use sklearn's utility for consistent behavior
        from sklearn.utils.validation import _check_feature_names_in
        input_features = _check_feature_names_in(self, input_features)
        return np.asarray([f"log_{feat}" for feat in input_features])
 
 
# Usage
transformer = LogTransformer(offset=1.0, base='natural')
X = np.array([[1, 10], [2, 20], [3, 30]])
X_log = transformer.fit_transform(X)
print(X_log)

The Most Common Bug

Forgetting return self in fit() is the single most common custom transformer bug. Without it, pipeline.fit(X, y) returns None, and subsequent operations fail with cryptic errors. Always end fit() with return self.

Stateless vs Stateful Transformers

Transformers fall into two fundamental categories based on whether they learn from data:

Stateless Transformers: Apply the same transformation regardless of training data. The transformation is defined entirely by constructor parameters.

Examples: LogTransformer, PowerTransformer with fixed exponent, FunctionTransformer

Stateful Transformers: Learn transformation parameters from training data. The transformation depends on what was seen during fit().

Examples: StandardScaler (learns mean, std), OneHotEncoder (learns categories), PCA (learns components)

This distinction has profound implications for cross-validation and data leakage.

stateless_transformer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer
import numpy as np
 
# Stateless Option 1: FunctionTransformer for simple cases
def winsorize(X, lower=0.05, upper=0.95):
    """Clip values to percentile range."""
    lower_bound = np.percentile(X, lower * 100, axis=0)
    upper_bound = np.percentile(X, upper * 100, axis=0)
    return np.clip(X, lower_bound, upper_bound)
 
# WARNING: This looks stateless but has a subtle bug!
# Percentiles are computed on transform-time data, not fit-time data.
# This causes train-test leakage: test data influences its own transformation.
 
winsorizer = FunctionTransformer(winsorize)
 
# Correct stateful implementation:
class WinsorizeTransformer(BaseEstimator, TransformerMixin):
    """
    Winsorize features by clipping to percentile bounds learned during fit.
    
    This is STATEFUL: percentile bounds are learned from training data
    and applied identically to all future data.
    """
    
    def __init__(self, lower_percentile=5, upper_percentile=95):
        self.lower_percentile = lower_percentile
        self.upper_percentile = upper_percentile
    
    def fit(self, X, y=None):
        X = self._validate_data(X, reset=True)
        
        # LEARN bounds from training data (suffix _ indicates learned)
        self.lower_bounds_ = np.percentile(X, self.lower_percentile, axis=0)
        self.upper_bounds_ = np.percentile(X, self.upper_percentile, axis=0)
        
        return self
    
    def transform(self, X):
        X = self._validate_data(X, reset=False)
        
        # APPLY learned bounds (works correctly even if test percentiles differ)
        return np.clip(X, self.lower_bounds_, self.upper_bounds_)
 
 
# True stateless transformer (transformation is fully parameterized)
class PolynomialExpansion(BaseEstimator, TransformerMixin):
    """
    Expand features with polynomial terms.
    
    This is STATELESS: the transformation is defined purely by parameters,
    not by any learning from data.
    """
    
    def __init__(self, degree=2, include_bias=False):
        self.degree = degree
        self.include_bias = include_bias
    
    def fit(self, X, y=None):
        # Stateless: fit only validates and records metadata
        X = self._validate_data(X, reset=True)
        return self
    
    def transform(self, X):
        X = self._validate_data(X, reset=False)
        
        # Pure function of input and parameters, no learned state
        from sklearn.preprocessing import PolynomialFeatures
        poly = PolynomialFeatures(
            degree=self.degree, 
            include_bias=self.include_bias
        )
        return poly.fit_transform(X)

The Stateful Transformer Test

Ask yourself: 'If I transform new data, should the result depend on what I saw during training?' If yes, your transformer is stateful and MUST learn parameters during fit(). If transforming [1, 2, 3] gives different results depending on whether you trained on [1, 100] vs [1, 1000], you have a stateful transformer.

Using FunctionTransformer for Simple Cases

For truly stateless transformations, scikit-learn provides FunctionTransformer as a convenient wrapper. It converts any function into a Pipeline-compatible transformer without writing a class:

function_transformer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
 
# Basic usage: wrap any function
log_transformer = FunctionTransformer(np.log1p)
 
# With inverse for interpretability
sqrt_transformer = FunctionTransformer(
    func=np.sqrt,
    inverse_func=np.square,  # Enables inverse_transform
    validate=True,  # Validate input
    check_inverse=True  # Verify inverse_func(func(X)) ≈ X
)
 
# With feature names
def extract_hour(X):
    """Extract hour from datetime column."""
    return X.apply(lambda col: pd.to_datetime(col).dt.hour if col.dtype == 'object' else col)
 
hour_extractor = FunctionTransformer(
    extract_hour,
    feature_names_out=lambda self, names: [f"{n}_hour" for n in names]
)
 
# Passing additional parameters via kw_args
def clip_values(X, lower, upper):
    return np.clip(X, lower, upper)
 
clipper = FunctionTransformer(
    clip_values,
    kw_args={'lower': 0, 'upper': 100}
)
 
# Lambda functions (use with caution: not pickleable!)
# square = FunctionTransformer(lambda x: x ** 2)  # Works locally
# joblib.dump(square, 'model.pkl')  # FAILS: lambda can't be pickled
 
# Solution: use named functions or functools.partial
from functools import partial
 
def power(X, exponent):
    return X ** exponent
 
square = FunctionTransformer(partial(power, exponent=2))
# This IS pickleable
 
# Complex example: chained in pipeline
preprocessing = Pipeline([
    ('log', FunctionTransformer(np.log1p, validate=True)),
    ('clip', FunctionTransformer(
        lambda X: np.clip(X, 0, 10),
        feature_names_out='one-to-one'  # Preserve feature names
    )),
])
 
# Validation mode
validated_transformer = FunctionTransformer(
    np.exp,
    accept_sparse=False,  # Don't accept sparse matrices
    check_inverse=False,
    validate=True  # Run sklearn validation
)

When to Use FunctionTransformer

•Pure, stateless transformations
•Wrapping numpy/pandas operations
•Quick experiments and prototyping
•One-liner transforms in a Pipeline

When to Write a Full Class

•Stateful transformers (learn from fit)
•Complex logic needing multiple methods
•Production code requiring serialization
•When you need parameter validation

Serialization Warning

FunctionTransformer with lambda functions cannot be pickled/serialized. For production models, use named functions, functools.partial, or full custom transformer classes. This is a critical production deployment consideration.

Building Robust Stateful Transformers

Most real-world custom transformers are stateful. Building them correctly requires careful attention to the estimator contract, validation, and edge cases. Let's build a production-grade example:

robust_stateful_transformer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
from sklearn.base import BaseEstimator, TransformerMixin, OneToOneFeatureMixin
from sklearn.utils.validation import check_is_fitted
import numpy as np
import warnings
 
class RobustOutlierClipper(BaseEstimator, TransformerMixin, OneToOneFeatureMixin):
    """
    Clip outliers using IQR-based bounds learned during fit.
    
    This transformer learns clipping bounds from training data using
    the Interquartile Range (IQR) method, commonly used in robust statistics.
    
    Parameters
    ----------
    iqr_multiplier : float, default=1.5
        Multiplier for IQR to determine outlier bounds.
        1.5 gives standard "outlier" detection; 3.0 gives "extreme outlier".
    
    per_feature : bool, default=True
        If True, learn separate bounds per feature.
        If False, learn global bounds across all features.
    
    Attributes
    ----------
    lower_bounds_ : ndarray of shape (n_features,)
        Lower clipping bounds learned during fit.
    
    upper_bounds_ : ndarray of shape (n_features,)
        Upper clipping bounds learned during fit.
    
    n_features_in_ : int
        Number of features seen during fit.
    
    n_clipped_ : dict
        Count of values clipped during last transform (for monitoring).
    
    Examples
    --------
    >>> import numpy as np
    >>> X = np.array([[1, 100], [2, 200], [3, 300], [4, 10000]])
    >>> clipper = RobustOutlierClipper(iqr_multiplier=1.5)
    >>> clipper.fit_transform(X)
    """
    
    # Class-level tags for sklearn compatibility
    _parameter_constraints = {
        'iqr_multiplier': [float, int],
        'per_feature': [bool],
    }
    
    def __init__(self, iqr_multiplier=1.5, per_feature=True):
        self.iqr_multiplier = iqr_multiplier
        self.per_feature = per_feature
    
    def fit(self, X, y=None):
        """
        Learn clipping bounds from training data.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data from which to learn bounds.
        y : Ignored
            Not used, present for API consistency.
        
        Returns
        -------
        self : RobustOutlierClipper
            Fitted transformer.
        """
        # Use sklearn's validation (handles DataFrame, sparse, etc.)
        X = self._validate_data(
            X, 
            reset=True,
            dtype=np.float64,  # Ensure numeric
            force_all_finite='allow-nan'  # Allow NaN, we'll handle it
        )
        
        # Validate parameters
        if self.iqr_multiplier <= 0:
            raise ValueError(
                f"iqr_multiplier must be positive, got {self.iqr_multiplier}"
            )
        
        # Handle NaN in statistics computation
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", RuntimeWarning)
            
            if self.per_feature:
                q1 = np.nanpercentile(X, 25, axis=0)
                q3 = np.nanpercentile(X, 75, axis=0)
            else:
                q1 = np.nanpercentile(X, 25)
                q3 = np.nanpercentile(X, 75)
                q1 = np.full(self.n_features_in_, q1)
                q3 = np.full(self.n_features_in_, q3)
        
        iqr = q3 - q1
        
        # Learned parameters (trailing underscore convention)
        self.lower_bounds_ = q1 - self.iqr_multiplier * iqr
        self.upper_bounds_ = q3 + self.iqr_multiplier * iqr
        
        # Handle edge case: zero IQR (constant column)
        zero_iqr_mask = iqr == 0
        if np.any(zero_iqr_mask):
            # For constant columns, don't clip (bounds = ±inf)
            self.lower_bounds_[zero_iqr_mask] = -np.inf
            self.upper_bounds_[zero_iqr_mask] = np.inf
            warnings.warn(
                f"Columns {np.where(zero_iqr_mask)[0]} have zero IQR; "
                "clipping disabled for these features.",
                UserWarning
            )
        
        return self
    
    def transform(self, X):
        """
        Clip outliers using bounds learned during fit.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Data to transform.
        
        Returns
        -------
        X_clipped : ndarray of shape (n_samples, n_features)
            Transformed data with outliers clipped.
        """
        # Verify fit has been called
        check_is_fitted(self, ['lower_bounds_', 'upper_bounds_'])
        
        # Validate and convert input
        X = self._validate_data(
            X, 
            reset=False,  # Don't reset n_features_in_
            dtype=np.float64,
            force_all_finite='allow-nan'
        )
        
        # Clip to learned bounds
        X_clipped = np.clip(X, self.lower_bounds_, self.upper_bounds_)
        
        # Track clipping statistics (useful for monitoring)
        self.n_clipped_ = {
            'lower': int(np.sum(X < self.lower_bounds_)),
            'upper': int(np.sum(X > self.upper_bounds_)),
            'total': int(np.sum((X < self.lower_bounds_) | (X > self.upper_bounds_)))
        }
        
        return X_clipped
    
    def get_feature_names_out(self, input_features=None):
        """Return feature names (unchanged by clipping)."""
        check_is_fitted(self)
        return self._get_feature_names_out(input_features)  # From OneToOneFeatureMixin
 
 
# Usage with edge cases
X_train = np.array([
    [1.0, 100, 5.0],
    [2.0, 200, 5.0],  # Third column is constant
    [3.0, 300, 5.0],
    [4.0, 400, 5.0],
    [100.0, 10000, 5.0]  # Outlier in first two columns
])
 
clipper = RobustOutlierClipper(iqr_multiplier=1.5)
X_transformed = clipper.fit_transform(X_train)
 
print(f"Bounds: lower={clipper.lower_bounds_}, upper={clipper.upper_bounds_}")
print(f"Clipping stats: {clipper.n_clipped_}")

Production Quality Checklist

High-quality transformers include: (1) Complete docstrings with parameters, attributes, and examples, (2) Parameter validation in fit(), (3) check_is_fitted() in transform(), (4) Proper handling of NaN and edge cases, (5) Monitoring attributes like n_clipped_ for observability.

Transformers for DataFrames

Modern ML workflows often use pandas DataFrames. While scikit-learn internally converts to numpy arrays, we can build transformers that preserve DataFrame semantics and leverage pandas functionality:

dataframe_transformers.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import numpy as np
 
class DatetimeFeatureExtractor(BaseEstimator, TransformerMixin):
    """
    Extract datetime components from datetime columns.
    
    This transformer works with pandas DataFrames and extracts
    useful features like hour, day of week, month, etc.
    
    Parameters
    ----------
    datetime_columns : list of str
        Column names containing datetime data
    features : list of str, default=['hour', 'dayofweek', 'month', 'year']
        Features to extract. Options: 'hour', 'minute', 'second',
        'dayofweek', 'day', 'month', 'year', 'quarter', 'is_weekend'
    drop_original : bool, default=True
        Whether to drop the original datetime columns
    """
    
    def __init__(self, datetime_columns, features=None, drop_original=True):
        self.datetime_columns = datetime_columns
        self.features = features or ['hour', 'dayofweek', 'month', 'year']
        self.drop_original = drop_original
    
    def fit(self, X, y=None):
        # Validate input is DataFrame
        if not isinstance(X, pd.DataFrame):
            raise TypeError(
                f"DatetimeFeatureExtractor requires pandas DataFrame, "
                f"got {type(X).__name__}"
            )
        
        # Validate columns exist
        missing = set(self.datetime_columns) - set(X.columns)
        if missing:
            raise ValueError(f"Columns not found in DataFrame: {missing}")
        
        # Store column order for consistent output
        self.feature_names_in_ = np.array(X.columns.tolist())
        self.n_features_in_ = len(self.feature_names_in_)
        
        return self
    
    def transform(self, X):
        if not isinstance(X, pd.DataFrame):
            raise TypeError(
                f"DatetimeFeatureExtractor requires pandas DataFrame, "
                f"got {type(X).__name__}"
            )
        
        X = X.copy()
        
        for col in self.datetime_columns:
            # Convert to datetime if needed
            dt = pd.to_datetime(X[col])
            
            # Extract requested features
            for feature in self.features:
                new_col_name = f"{col}_{feature}"
                
                if feature == 'hour':
                    X[new_col_name] = dt.dt.hour
                elif feature == 'minute':
                    X[new_col_name] = dt.dt.minute
                elif feature == 'second':
                    X[new_col_name] = dt.dt.second
                elif feature == 'dayofweek':
                    X[new_col_name] = dt.dt.dayofweek
                elif feature == 'day':
                    X[new_col_name] = dt.dt.day
                elif feature == 'month':
                    X[new_col_name] = dt.dt.month
                elif feature == 'year':
                    X[new_col_name] = dt.dt.year
                elif feature == 'quarter':
                    X[new_col_name] = dt.dt.quarter
                elif feature == 'is_weekend':
                    X[new_col_name] = (dt.dt.dayofweek >= 5).astype(int)
            
            if self.drop_original:
                X = X.drop(columns=[col])
        
        return X
    
    def get_feature_names_out(self, input_features=None):
        """Return output feature names."""
        output_cols = []
        
        for col in self.feature_names_in_:
            if col in self.datetime_columns:
                if not self.drop_original:
                    output_cols.append(col)
                for feature in self.features:
                    output_cols.append(f"{col}_{feature}")
            else:
                output_cols.append(col)
        
        return np.array(output_cols)
 
 
# DataFrameTransformer wrapper for any sklearn transformer
class DataFrameWrapper(BaseEstimator, TransformerMixin):
    """
    Wrap an sklearn transformer to preserve DataFrame output.
    
    Useful when you want numpy-based transformers to return DataFrames
    with proper column names.
    """
    
    def __init__(self, transformer):
        self.transformer = transformer
    
    def fit(self, X, y=None):
        self.transformer.fit(X, y)
        if hasattr(self.transformer, 'feature_names_in_'):
            self.feature_names_in_ = self.transformer.feature_names_in_
        return self
    
    def transform(self, X):
        X_transformed = self.transformer.transform(X)
        
        # Get feature names from transformer if available
        if hasattr(self.transformer, 'get_feature_names_out'):
            columns = self.transformer.get_feature_names_out()
        elif hasattr(X, 'columns'):
            columns = X.columns
        else:
            columns = [f'feature_{i}' for i in range(X_transformed.shape[1])]
        
        return pd.DataFrame(X_transformed, columns=columns, index=getattr(X, 'index', None))
 
 
# Usage
df = pd.DataFrame({
    'signup_date': ['2023-01-15 10:30:00', '2023-06-20 14:45:00'],
    'amount': [100.0, 250.0]
})
 
extractor = DatetimeFeatureExtractor(
    datetime_columns=['signup_date'],
    features=['hour', 'dayofweek', 'month', 'is_weekend']
)
 
df_transformed = extractor.fit_transform(df)
print(df_transformed.columns.tolist())
# ['amount', 'signup_date_hour', 'signup_date_dayofweek', 
#  'signup_date_month', 'signup_date_is_weekend']

set_output API (sklearn >= 1.2)

Scikit-learn 1.2+ includes a set_output API that makes transformers return DataFrames automatically. Use transformer.set_output(transform='pandas') to enable. This reduces the need for custom DataFrame-preserving transformers in newer sklearn versions.

Transformers with Inverse Transform

Some transformers benefit from an inverse_transform method that reverses the transformation. This is essential for interpretability—converting model outputs back to the original feature scale:

inverse_transform.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted
import numpy as np
 
class BoxCoxTransformer(BaseEstimator, TransformerMixin):
    """
    Apply Box-Cox transformation with learned optimal lambda.
    
    Supports inverse_transform for converting back to original scale.
    
    Parameters
    ----------
    lmbda : float or 'auto', default='auto'
        Box-Cox parameter. If 'auto', learns optimal lambda from data.
    shift : float, default=0
        Value added to ensure all values are positive: y = x + shift
    
    Attributes
    ----------
    lmbda_ : float
        The actual lambda used (learned if lmbda='auto')
    shift_ : float
        The actual shift used (may be computed if data has non-positive values)
    """
    
    def __init__(self, lmbda='auto', shift=0):
        self.lmbda = lmbda
        self.shift = shift
    
    def fit(self, X, y=None):
        X = self._validate_data(X, reset=True)
        
        # Ensure data is positive
        min_val = np.min(X)
        if min_val <= 0:
            self.shift_ = -min_val + 1e-6
        else:
            self.shift_ = self.shift
        
        X_shifted = X + self.shift_
        
        # Learn optimal lambda or use provided
        if self.lmbda == 'auto':
            from scipy import stats
            # Learn lambda that maximizes log-likelihood
            _, self.lmbda_ = stats.boxcox(X_shifted.flatten())
        else:
            self.lmbda_ = float(self.lmbda)
        
        return self
    
    def transform(self, X):
        check_is_fitted(self, ['lmbda_', 'shift_'])
        X = self._validate_data(X, reset=False)
        
        X_shifted = X + self.shift_
        
        # Box-Cox transformation
        if np.abs(self.lmbda_) < 1e-10:
            # Lambda ≈ 0: use log transformation
            return np.log(X_shifted)
        else:
            return (np.power(X_shifted, self.lmbda_) - 1) / self.lmbda_
    
    def inverse_transform(self, X_transformed):
        """
        Reverse the Box-Cox transformation.
        
        Parameters
        ----------
        X_transformed : array-like of shape (n_samples, n_features)
            Data in transformed space.
        
        Returns
        -------
        X : ndarray of shape (n_samples, n_features)
            Data in original space.
        """
        check_is_fitted(self, ['lmbda_', 'shift_'])
        X_transformed = np.asarray(X_transformed)
        
        # Inverse Box-Cox
        if np.abs(self.lmbda_) < 1e-10:
            X_shifted = np.exp(X_transformed)
        else:
            X_shifted = np.power(X_transformed * self.lmbda_ + 1, 1 / self.lmbda_)
        
        # Remove shift
        return X_shifted - self.shift_
 
 
# Usage: Transform and inverse transform
np.random.seed(42)
X = np.random.exponential(scale=2.0, size=(100, 2))  # Right-skewed data
 
transformer = BoxCoxTransformer(lmbda='auto')
X_transformed = transformer.fit_transform(X)
X_reconstructed = transformer.inverse_transform(X_transformed)
 
print(f"Learned lambda: {transformer.lmbda_:.4f}")
print(f"Original mean: {X.mean():.4f}")
print(f"Transformed mean: {X_transformed.mean():.4f}")
print(f"Reconstruction error: {np.abs(X - X_reconstructed).max():.2e}")
 
 
# Practical use: Inverse transform predictions
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
 
# Transform target variable
y = np.random.exponential(scale=100, size=100) + 50
y_transformer = BoxCoxTransformer(lmbda='auto')
y_transformed = y_transformer.fit_transform(y.reshape(-1, 1)).ravel()
 
# Train on transformed target
model = LinearRegression()
model.fit(X, y_transformed)
 
# Predict in transformed space
y_pred_transformed = model.predict(X[:5])
 
# Convert predictions back to original scale
y_pred_original = y_transformer.inverse_transform(y_pred_transformed.reshape(-1, 1))
print(f"Predictions in original scale: {y_pred_original.ravel()}")

When To Implement inverse_transform

Implement inverse_transform when: (1) You're transforming targets and need predictions in original units, (2) You need to explain feature effects in original scale, (3) The transformation is mathematically invertible, (4) You're building pipelines that support full round-trip transformation.

Compatibility with Clone and Cross-Validation

Scikit-learn's cross-validation and grid search rely on clone() to create fresh copies of estimators for each fold. Custom transformers must be "clonable" to work correctly. This requires strict adherence to the estimator contract:

clone_compatibility.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
from sklearn.base import BaseEstimator, TransformerMixin, clone
import numpy as np
 
# BROKEN: Won't clone correctly
class BrokenTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, threshold=0.5, **kwargs):
        self.threshold = threshold
        self.extra_stuff = kwargs  # BAD: arbitrary kwargs break cloning
        self._precomputed = self.threshold * 2  # BAD: computation in __init__
    
    def fit(self, X, y=None):
        return self
 
# Attempting to clone
broken = BrokenTransformer(threshold=0.7, random_kwarg=42)
try:
    cloned = clone(broken)
except Exception as e:
    print(f"Clone failed: {e}")
 
 
# CORRECT: Follows cloning requirements
class CorrectTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, threshold=0.5):
        # Rule 1: Parameters stored with EXACT same name as argument
        self.threshold = threshold
        # Rule 2: NO computation, NO derived values
    
    def fit(self, X, y=None):
        X = self._validate_data(X, reset=True)
        # Computed values go here, with trailing underscore
        self.threshold_value_ = self.threshold * np.std(X)
        return self
    
    def transform(self, X):
        X = self._validate_data(X, reset=False)
        return np.where(X > self.threshold_value_, X, 0)
 
# Cloning works
correct = CorrectTransformer(threshold=0.7)
correct.fit(np.array([[1, 2], [3, 4]]))
print(f"Original threshold_value_: {correct.threshold_value_}")
 
cloned = clone(correct)
print(f"Cloned has threshold: {cloned.threshold}")
print(f"Cloned is fitted: {hasattr(cloned, 'threshold_value_')}")  # False - unfitted!
 
 
# Verifying clone behavior
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
 
pipeline = Pipeline([
    ('transformer', CorrectTransformer(threshold=0.5)),
    ('classifier', LogisticRegression())
])
 
X = np.random.randn(100, 5)
y = (X[:, 0] > 0).astype(int)
 
# cross_val_score clones the pipeline for each fold
scores = cross_val_score(pipeline, X, y, cv=5)
print(f"CV scores: {scores}")  # Works because transformer is clonable
 
 
# Testing clonability
def test_clonable(estimator):
    """Test that an estimator can be cloned correctly."""
    try:
        cloned = clone(estimator)
        
        # Check parameters match
        orig_params = estimator.get_params()
        clone_params = cloned.get_params()
        
        for key in orig_params:
            if orig_params[key] != clone_params[key]:
                print(f"Parameter mismatch: {key}")
                return False
        
        # Check clone is unfitted
        fitted_attrs = [a for a in dir(cloned) if a.endswith('_') and not a.startswith('_')]
        if fitted_attrs:
            print(f"Clone appears fitted: {fitted_attrs}")
            return False
        
        print("Estimator is clonable ✓")
        return True
        
    except Exception as e:
        print(f"Clone failed: {e}")
        return False
 
test_clonable(CorrectTransformer(threshold=0.5))

Clonability Checklist

•All __init__ parameters stored as attributes with identical names — __init__(self, x=1) must store self.x = x
•No computation in __init__ — Derived values computed in fit() break clone's assumption that __init__(params) creates equivalent unfitted estimator
•No *args or **kwargs — These cannot be introspected by get_params() for cloning
•Learned attributes use trailing underscore — self.mean_, self.coef_, etc. Clone creates unfitted copy without these
•All attributes are pickleable — If you store lambdas or file handles, serialization fails

The __init__ Pitfall

Never do self.derived_param = some_function(self.param) in __init__. Clone recreates the estimator by calling __init__(**get_params()) and expects an unfitted estimator. Derived values must be computed in fit() and stored with trailing underscore.

Testing Custom Transformers

Robust custom transformers require thorough testing. Scikit-learn provides utilities to verify estimator compliance, and you should add domain-specific tests:

testing_transformers.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import numpy as np
import pytest
from sklearn.utils.estimator_checks import check_estimator, parametrize_with_checks
from sklearn.base import clone
 
# Your custom transformer
class MyTransformer:
    # ... implementation ...
    pass
 
 
# Method 1: Run all sklearn checks (comprehensive but slow)
def test_sklearn_compliance():
    """Run sklearn's estimator checks."""
    transformer = MyTransformer()
    
    # This runs ~30 checks for transformers
    for estimator, check in check_estimator(transformer, generate_only=True):
        try:
            check(estimator)
        except Exception as e:
            print(f"Check {check.func.__name__} failed: {e}")
 
 
# Method 2: Use pytest parametrize (recommended for CI)
@parametrize_with_checks([MyTransformer()])
def test_sklearn_compatible(estimator, check):
    check(estimator)
 
 
# Method 3: Manual essential tests (faster, targeted)
class TestMyTransformer:
    
    @pytest.fixture
    def sample_data(self):
        np.random.seed(42)
        return np.random.randn(100, 5)
    
    @pytest.fixture
    def transformer(self):
        return MyTransformer(param=1.0)
    
    def test_fit_returns_self(self, transformer, sample_data):
        """Verify fit() returns self for method chaining."""
        result = transformer.fit(sample_data)
        assert result is transformer
    
    def test_transform_shape(self, transformer, sample_data):
        """Verify transform preserves number of samples."""
        transformer.fit(sample_data)
        transformed = transformer.transform(sample_data)
        assert transformed.shape[0] == sample_data.shape[0]
    
    def test_fit_transform_equals_fit_then_transform(self, transformer, sample_data):
        """Verify fit_transform consistency."""
        t1 = clone(transformer)
        t2 = clone(transformer)
        
        result1 = t1.fit_transform(sample_data)
        result2 = t2.fit(sample_data).transform(sample_data)
        
        np.testing.assert_array_almost_equal(result1, result2)
    
    def test_transform_without_fit_raises(self, transformer, sample_data):
        """Verify transform before fit raises error."""
        with pytest.raises(Exception):  # NotFittedError or similar
            transformer.transform(sample_data)
    
    def test_clone_produces_unfitted_copy(self, transformer, sample_data):
        """Verify clone creates unfitted estimator."""
        transformer.fit(sample_data)
        cloned = clone(transformer)
        
        # Clone should not be fitted
        assert not hasattr(cloned, 'n_features_in_') or \
               getattr(cloned, 'n_features_in_', None) is None
    
    def test_get_set_params(self, transformer):
        """Verify parameter introspection."""
        params = transformer.get_params()
        assert 'param' in params
        
        new_transformer = transformer.set_params(param=2.0)
        assert new_transformer.param == 2.0
    
    def test_handles_pandas_dataframe(self, transformer):
        """Verify DataFrame compatibility."""
        import pandas as pd
        df = pd.DataFrame(np.random.randn(50, 3), columns=['a', 'b', 'c'])
        
        transformer.fit(df)
        result = transformer.transform(df)
        # Should not raise
    
    def test_handles_nan(self, transformer, sample_data):
        """Verify NaN handling (if applicable)."""
        data_with_nan = sample_data.copy()
        data_with_nan[0, 0] = np.nan
        
        # Either handles gracefully or raises informative error
        try:
            transformer.fit(data_with_nan)
        except ValueError as e:
            assert 'nan' in str(e).lower() or 'missing' in str(e).lower()
    
    def test_pickling(self, transformer, sample_data):
        """Verify transformer can be serialized."""
        import pickle
        
        transformer.fit(sample_data)
        pickled = pickle.dumps(transformer)
        unpickled = pickle.loads(pickled)
        
        np.testing.assert_array_almost_equal(
            transformer.transform(sample_data),
            unpickled.transform(sample_data)
        )

Testing Strategy

Start with check_estimator() to catch contract violations. Then add domain-specific tests for your transformer's unique behavior. Include edge cases: empty data, single sample, single feature, all-NaN column, constant columns, mixed dtypes. Test serialization if used in production.

Summary: Custom Transformers

Custom transformers unlock the ability to incorporate arbitrary domain logic into scikit-learn workflows. Let's consolidate the key insights:

Key Takeaways

•The transformer contract is precise — Implement fit(), transform(), return self from fit, store parameters exactly as named in __init__.
•Stateless vs stateful matters for correctness — Stateful transformers must learn parameters during fit(); stateless can use FunctionTransformer.
•FunctionTransformer handles simple cases — Wrap any pure function, but beware of lambda serialization issues.
•Production transformers need robustness — Validate inputs, handle edge cases, provide helpful error messages, track metadata.
•DataFrame support enables modern workflows — Use set_output API or custom wrappers to preserve DataFrame semantics.
•Inverse transforms aid interpretability — Implement when mathematically possible to convert back to original scale.
•Clonability requires strict init discipline — No computation, no derived values, no *args/**kwargs.
•Test thoroughly — Use sklearn's check_estimator() plus custom edge case tests.

What's Next:

Once you've built robust transformation pipelines, you need to save them for production use. The next page covers Pipeline Serialization—how to persist fitted pipelines, manage versioning, handle compatibility, and ensure reliable deployment.

Page Complete

You now have the knowledge to build production-grade custom transformers that integrate with scikit-learn's ecosystem. Next, we'll learn how to serialize these transformers and pipelines for deployment and reproducibility.

3 / 5

Loading learning content...

ML Systems & ProductionFeature Transformation Pipelines

Feature Transformation Pipelines

LevelIntermediate

Duration90 mins

TopicFeature Transformation Pipelines

3 / 5

Custom Transformers

Beyond Built-in Transformations

Consider these scenarios:

Domain-specific feature engineering: Converting chemical compound structures to fingerprint vectors, extracting features from genomic sequences, or parsing financial instrument specifications
Business logic transformations: Calculating customer lifetime value estimates, applying proprietary risk scoring formulas, or implementing industry-standard normalizations
Data quality corrections: Fixing known data entry patterns, converting legacy encoding schemes, or reconciling inconsistent formats

What You Will Learn

The Transformer Contract Revisited

Required Methods:

Method	Signature	Behavior
`fit`	`fit(X, y=None)`	Learn parameters from data. Returns `self`.
`transform`	`transform(X)`	Apply transformation. Returns transformed data.

Optional But Expected:

Method	Signature	Behavior
`fit_transform`	`fit_transform(X, y=None)`	Convenience: fit then transform. May be optimized.
`get_params`	`get_params(deep=True)`	Return dict of constructor parameters.
`set_params`	`set_params(**params)`	Set parameters. Returns `self`.
`get_feature_names_out`	`get_feature_names_out(input_features=None)`	Return output feature names.

transformer_contract.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
 
class LogTransformer(BaseEstimator, TransformerMixin):
    """
    Apply log transformation to numerical features.
    
    Demonstrates the minimal transformer contract:
    - Inherits from BaseEstimator (get_params, set_params)
    - Inherits from TransformerMixin (fit_transform default implementation)
    - Implements fit() and transform()
    
    Parameters
    ----------
    offset : float, default=1.0
        Value added before log to handle zeros: log(x + offset)
    base : {'natural', '10', '2'}, default='natural'
        Logarithm base to use
    
    Attributes
    ----------
    n_features_in_ : int
        Number of features seen during fit
    feature_names_in_ : ndarray of shape (n_features_in_,)
        Names of features seen during fit (if available)
    """
    
    def __init__(self, offset=1.0, base='natural'):
        # Rule 1: Store constructor parameters as attributes with SAME NAME
        self.offset = offset
        self.base = base
        # Rule 2: NO computation in __init__. Only parameter storage.
    
    def fit(self, X, y=None):
        """
        Fit: learn any necessary parameters from training data.
        
        For log transform, we don't learn anything data-dependent,
        but we still validate and record metadata.
        """
        # Validate input using sklearn's validation utilities
        X = self._validate_data(X, reset=True)
        
        # _validate_data automatically sets:
        # - self.n_features_in_ (number of input features)
        # - self.feature_names_in_ (if X has feature names, e.g., DataFrame)
        
        # Validate parameter values
        if self.offset <= 0:
            raise ValueError(f"offset must be positive, got {self.offset}")
        if self.base not in {'natural', '10', '2'}:
            raise ValueError(f"base must be 'natural', '10', or '2', got {self.base}")
        
        # Rule 3: MUST return self for method chaining
        return self
    
    def transform(self, X):
        """
        Transform: apply the log transformation.
        """
        # Validate against fitted state
        X = self._validate_data(X, reset=False)  # reset=False checks against fit
        
        # Rule 4: Don't modify input; work on copy
        X_transformed = X + self.offset
        
        if self.base == 'natural':
            X_transformed = np.log(X_transformed)
        elif self.base == '10':
            X_transformed = np.log10(X_transformed)
        elif self.base == '2':
            X_transformed = np.log2(X_transformed)
        
        return X_transformed
    
    def get_feature_names_out(self, input_features=None):
        """
        Return feature names for output features.
        """
        # Use sklearn's utility for consistent behavior
        from sklearn.utils.validation import _check_feature_names_in
        input_features = _check_feature_names_in(self, input_features)
        return np.asarray([f"log_{feat}" for feat in input_features])
 
 
# Usage
transformer = LogTransformer(offset=1.0, base='natural')
X = np.array([[1, 10], [2, 20], [3, 30]])
X_log = transformer.fit_transform(X)
print(X_log)

The Most Common Bug

Stateless vs Stateful Transformers

Transformers fall into two fundamental categories based on whether they learn from data:

Stateless Transformers: Apply the same transformation regardless of training data. The transformation is defined entirely by constructor parameters.

Examples: LogTransformer, PowerTransformer with fixed exponent, FunctionTransformer

Stateful Transformers: Learn transformation parameters from training data. The transformation depends on what was seen during fit().

Examples: StandardScaler (learns mean, std), OneHotEncoder (learns categories), PCA (learns components)

This distinction has profound implications for cross-validation and data leakage.

stateless_transformer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer
import numpy as np
 
# Stateless Option 1: FunctionTransformer for simple cases
def winsorize(X, lower=0.05, upper=0.95):
    """Clip values to percentile range."""
    lower_bound = np.percentile(X, lower * 100, axis=0)
    upper_bound = np.percentile(X, upper * 100, axis=0)
    return np.clip(X, lower_bound, upper_bound)
 
# WARNING: This looks stateless but has a subtle bug!
# Percentiles are computed on transform-time data, not fit-time data.
# This causes train-test leakage: test data influences its own transformation.
 
winsorizer = FunctionTransformer(winsorize)
 
# Correct stateful implementation:
class WinsorizeTransformer(BaseEstimator, TransformerMixin):
    """
    Winsorize features by clipping to percentile bounds learned during fit.
    
    This is STATEFUL: percentile bounds are learned from training data
    and applied identically to all future data.
    """
    
    def __init__(self, lower_percentile=5, upper_percentile=95):
        self.lower_percentile = lower_percentile
        self.upper_percentile = upper_percentile
    
    def fit(self, X, y=None):
        X = self._validate_data(X, reset=True)
        
        # LEARN bounds from training data (suffix _ indicates learned)
        self.lower_bounds_ = np.percentile(X, self.lower_percentile, axis=0)
        self.upper_bounds_ = np.percentile(X, self.upper_percentile, axis=0)
        
        return self
    
    def transform(self, X):
        X = self._validate_data(X, reset=False)
        
        # APPLY learned bounds (works correctly even if test percentiles differ)
        return np.clip(X, self.lower_bounds_, self.upper_bounds_)
 
 
# True stateless transformer (transformation is fully parameterized)
class PolynomialExpansion(BaseEstimator, TransformerMixin):
    """
    Expand features with polynomial terms.
    
    This is STATELESS: the transformation is defined purely by parameters,
    not by any learning from data.
    """
    
    def __init__(self, degree=2, include_bias=False):
        self.degree = degree
        self.include_bias = include_bias
    
    def fit(self, X, y=None):
        # Stateless: fit only validates and records metadata
        X = self._validate_data(X, reset=True)
        return self
    
    def transform(self, X):
        X = self._validate_data(X, reset=False)
        
        # Pure function of input and parameters, no learned state
        from sklearn.preprocessing import PolynomialFeatures
        poly = PolynomialFeatures(
            degree=self.degree, 
            include_bias=self.include_bias
        )
        return poly.fit_transform(X)

The Stateful Transformer Test

Using FunctionTransformer for Simple Cases

For truly stateless transformations, scikit-learn provides FunctionTransformer as a convenient wrapper. It converts any function into a Pipeline-compatible transformer without writing a class:

function_transformer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
 
# Basic usage: wrap any function
log_transformer = FunctionTransformer(np.log1p)
 
# With inverse for interpretability
sqrt_transformer = FunctionTransformer(
    func=np.sqrt,
    inverse_func=np.square,  # Enables inverse_transform
    validate=True,  # Validate input
    check_inverse=True  # Verify inverse_func(func(X)) ≈ X
)
 
# With feature names
def extract_hour(X):
    """Extract hour from datetime column."""
    return X.apply(lambda col: pd.to_datetime(col).dt.hour if col.dtype == 'object' else col)
 
hour_extractor = FunctionTransformer(
    extract_hour,
    feature_names_out=lambda self, names: [f"{n}_hour" for n in names]
)
 
# Passing additional parameters via kw_args
def clip_values(X, lower, upper):
    return np.clip(X, lower, upper)
 
clipper = FunctionTransformer(
    clip_values,
    kw_args={'lower': 0, 'upper': 100}
)
 
# Lambda functions (use with caution: not pickleable!)
# square = FunctionTransformer(lambda x: x ** 2)  # Works locally
# joblib.dump(square, 'model.pkl')  # FAILS: lambda can't be pickled
 
# Solution: use named functions or functools.partial
from functools import partial
 
def power(X, exponent):
    return X ** exponent
 
square = FunctionTransformer(partial(power, exponent=2))
# This IS pickleable
 
# Complex example: chained in pipeline
preprocessing = Pipeline([
    ('log', FunctionTransformer(np.log1p, validate=True)),
    ('clip', FunctionTransformer(
        lambda X: np.clip(X, 0, 10),
        feature_names_out='one-to-one'  # Preserve feature names
    )),
])
 
# Validation mode
validated_transformer = FunctionTransformer(
    np.exp,
    accept_sparse=False,  # Don't accept sparse matrices
    check_inverse=False,
    validate=True  # Run sklearn validation
)

When to Use FunctionTransformer

•Pure, stateless transformations
•Wrapping numpy/pandas operations
•Quick experiments and prototyping
•One-liner transforms in a Pipeline

When to Write a Full Class

•Stateful transformers (learn from fit)
•Complex logic needing multiple methods
•Production code requiring serialization
•When you need parameter validation

Serialization Warning

Building Robust Stateful Transformers

Most real-world custom transformers are stateful. Building them correctly requires careful attention to the estimator contract, validation, and edge cases. Let's build a production-grade example:

robust_stateful_transformer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
from sklearn.base import BaseEstimator, TransformerMixin, OneToOneFeatureMixin
from sklearn.utils.validation import check_is_fitted
import numpy as np
import warnings
 
class RobustOutlierClipper(BaseEstimator, TransformerMixin, OneToOneFeatureMixin):
    """
    Clip outliers using IQR-based bounds learned during fit.
    
    This transformer learns clipping bounds from training data using
    the Interquartile Range (IQR) method, commonly used in robust statistics.
    
    Parameters
    ----------
    iqr_multiplier : float, default=1.5
        Multiplier for IQR to determine outlier bounds.
        1.5 gives standard "outlier" detection; 3.0 gives "extreme outlier".
    
    per_feature : bool, default=True
        If True, learn separate bounds per feature.
        If False, learn global bounds across all features.
    
    Attributes
    ----------
    lower_bounds_ : ndarray of shape (n_features,)
        Lower clipping bounds learned during fit.
    
    upper_bounds_ : ndarray of shape (n_features,)
        Upper clipping bounds learned during fit.
    
    n_features_in_ : int
        Number of features seen during fit.
    
    n_clipped_ : dict
        Count of values clipped during last transform (for monitoring).
    
    Examples
    --------
    >>> import numpy as np
    >>> X = np.array([[1, 100], [2, 200], [3, 300], [4, 10000]])
    >>> clipper = RobustOutlierClipper(iqr_multiplier=1.5)
    >>> clipper.fit_transform(X)
    """
    
    # Class-level tags for sklearn compatibility
    _parameter_constraints = {
        'iqr_multiplier': [float, int],
        'per_feature': [bool],
    }
    
    def __init__(self, iqr_multiplier=1.5, per_feature=True):
        self.iqr_multiplier = iqr_multiplier
        self.per_feature = per_feature
    
    def fit(self, X, y=None):
        """
        Learn clipping bounds from training data.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data from which to learn bounds.
        y : Ignored
            Not used, present for API consistency.
        
        Returns
        -------
        self : RobustOutlierClipper
            Fitted transformer.
        """
        # Use sklearn's validation (handles DataFrame, sparse, etc.)
        X = self._validate_data(
            X, 
            reset=True,
            dtype=np.float64,  # Ensure numeric
            force_all_finite='allow-nan'  # Allow NaN, we'll handle it
        )
        
        # Validate parameters
        if self.iqr_multiplier <= 0:
            raise ValueError(
                f"iqr_multiplier must be positive, got {self.iqr_multiplier}"
            )
        
        # Handle NaN in statistics computation
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", RuntimeWarning)
            
            if self.per_feature:
                q1 = np.nanpercentile(X, 25, axis=0)
                q3 = np.nanpercentile(X, 75, axis=0)
            else:
                q1 = np.nanpercentile(X, 25)
                q3 = np.nanpercentile(X, 75)
                q1 = np.full(self.n_features_in_, q1)
                q3 = np.full(self.n_features_in_, q3)
        
        iqr = q3 - q1
        
        # Learned parameters (trailing underscore convention)
        self.lower_bounds_ = q1 - self.iqr_multiplier * iqr
        self.upper_bounds_ = q3 + self.iqr_multiplier * iqr
        
        # Handle edge case: zero IQR (constant column)
        zero_iqr_mask = iqr == 0
        if np.any(zero_iqr_mask):
            # For constant columns, don't clip (bounds = ±inf)
            self.lower_bounds_[zero_iqr_mask] = -np.inf
            self.upper_bounds_[zero_iqr_mask] = np.inf
            warnings.warn(
                f"Columns {np.where(zero_iqr_mask)[0]} have zero IQR; "
                "clipping disabled for these features.",
                UserWarning
            )
        
        return self
    
    def transform(self, X):
        """
        Clip outliers using bounds learned during fit.
        
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Data to transform.
        
        Returns
        -------
        X_clipped : ndarray of shape (n_samples, n_features)
            Transformed data with outliers clipped.
        """
        # Verify fit has been called
        check_is_fitted(self, ['lower_bounds_', 'upper_bounds_'])
        
        # Validate and convert input
        X = self._validate_data(
            X, 
            reset=False,  # Don't reset n_features_in_
            dtype=np.float64,
            force_all_finite='allow-nan'
        )
        
        # Clip to learned bounds
        X_clipped = np.clip(X, self.lower_bounds_, self.upper_bounds_)
        
        # Track clipping statistics (useful for monitoring)
        self.n_clipped_ = {
            'lower': int(np.sum(X < self.lower_bounds_)),
            'upper': int(np.sum(X > self.upper_bounds_)),
            'total': int(np.sum((X < self.lower_bounds_) | (X > self.upper_bounds_)))
        }
        
        return X_clipped
    
    def get_feature_names_out(self, input_features=None):
        """Return feature names (unchanged by clipping)."""
        check_is_fitted(self)
        return self._get_feature_names_out(input_features)  # From OneToOneFeatureMixin
 
 
# Usage with edge cases
X_train = np.array([
    [1.0, 100, 5.0],
    [2.0, 200, 5.0],  # Third column is constant
    [3.0, 300, 5.0],
    [4.0, 400, 5.0],
    [100.0, 10000, 5.0]  # Outlier in first two columns
])
 
clipper = RobustOutlierClipper(iqr_multiplier=1.5)
X_transformed = clipper.fit_transform(X_train)
 
print(f"Bounds: lower={clipper.lower_bounds_}, upper={clipper.upper_bounds_}")
print(f"Clipping stats: {clipper.n_clipped_}")

Production Quality Checklist

Transformers for DataFrames

Modern ML workflows often use pandas DataFrames. While scikit-learn internally converts to numpy arrays, we can build transformers that preserve DataFrame semantics and leverage pandas functionality:

dataframe_transformers.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import numpy as np
 
class DatetimeFeatureExtractor(BaseEstimator, TransformerMixin):
    """
    Extract datetime components from datetime columns.
    
    This transformer works with pandas DataFrames and extracts
    useful features like hour, day of week, month, etc.
    
    Parameters
    ----------
    datetime_columns : list of str
        Column names containing datetime data
    features : list of str, default=['hour', 'dayofweek', 'month', 'year']
        Features to extract. Options: 'hour', 'minute', 'second',
        'dayofweek', 'day', 'month', 'year', 'quarter', 'is_weekend'
    drop_original : bool, default=True
        Whether to drop the original datetime columns
    """
    
    def __init__(self, datetime_columns, features=None, drop_original=True):
        self.datetime_columns = datetime_columns
        self.features = features or ['hour', 'dayofweek', 'month', 'year']
        self.drop_original = drop_original
    
    def fit(self, X, y=None):
        # Validate input is DataFrame
        if not isinstance(X, pd.DataFrame):
            raise TypeError(
                f"DatetimeFeatureExtractor requires pandas DataFrame, "
                f"got {type(X).__name__}"
            )
        
        # Validate columns exist
        missing = set(self.datetime_columns) - set(X.columns)
        if missing:
            raise ValueError(f"Columns not found in DataFrame: {missing}")
        
        # Store column order for consistent output
        self.feature_names_in_ = np.array(X.columns.tolist())
        self.n_features_in_ = len(self.feature_names_in_)
        
        return self
    
    def transform(self, X):
        if not isinstance(X, pd.DataFrame):
            raise TypeError(
                f"DatetimeFeatureExtractor requires pandas DataFrame, "
                f"got {type(X).__name__}"
            )
        
        X = X.copy()
        
        for col in self.datetime_columns:
            # Convert to datetime if needed
            dt = pd.to_datetime(X[col])
            
            # Extract requested features
            for feature in self.features:
                new_col_name = f"{col}_{feature}"
                
                if feature == 'hour':
                    X[new_col_name] = dt.dt.hour
                elif feature == 'minute':
                    X[new_col_name] = dt.dt.minute
                elif feature == 'second':
                    X[new_col_name] = dt.dt.second
                elif feature == 'dayofweek':
                    X[new_col_name] = dt.dt.dayofweek
                elif feature == 'day':
                    X[new_col_name] = dt.dt.day
                elif feature == 'month':
                    X[new_col_name] = dt.dt.month
                elif feature == 'year':
                    X[new_col_name] = dt.dt.year
                elif feature == 'quarter':
                    X[new_col_name] = dt.dt.quarter
                elif feature == 'is_weekend':
                    X[new_col_name] = (dt.dt.dayofweek >= 5).astype(int)
            
            if self.drop_original:
                X = X.drop(columns=[col])
        
        return X
    
    def get_feature_names_out(self, input_features=None):
        """Return output feature names."""
        output_cols = []
        
        for col in self.feature_names_in_:
            if col in self.datetime_columns:
                if not self.drop_original:
                    output_cols.append(col)
                for feature in self.features:
                    output_cols.append(f"{col}_{feature}")
            else:
                output_cols.append(col)
        
        return np.array(output_cols)
 
 
# DataFrameTransformer wrapper for any sklearn transformer
class DataFrameWrapper(BaseEstimator, TransformerMixin):
    """
    Wrap an sklearn transformer to preserve DataFrame output.
    
    Useful when you want numpy-based transformers to return DataFrames
    with proper column names.
    """
    
    def __init__(self, transformer):
        self.transformer = transformer
    
    def fit(self, X, y=None):
        self.transformer.fit(X, y)
        if hasattr(self.transformer, 'feature_names_in_'):
            self.feature_names_in_ = self.transformer.feature_names_in_
        return self
    
    def transform(self, X):
        X_transformed = self.transformer.transform(X)
        
        # Get feature names from transformer if available
        if hasattr(self.transformer, 'get_feature_names_out'):
            columns = self.transformer.get_feature_names_out()
        elif hasattr(X, 'columns'):
            columns = X.columns
        else:
            columns = [f'feature_{i}' for i in range(X_transformed.shape[1])]
        
        return pd.DataFrame(X_transformed, columns=columns, index=getattr(X, 'index', None))
 
 
# Usage
df = pd.DataFrame({
    'signup_date': ['2023-01-15 10:30:00', '2023-06-20 14:45:00'],
    'amount': [100.0, 250.0]
})
 
extractor = DatetimeFeatureExtractor(
    datetime_columns=['signup_date'],
    features=['hour', 'dayofweek', 'month', 'is_weekend']
)
 
df_transformed = extractor.fit_transform(df)
print(df_transformed.columns.tolist())
# ['amount', 'signup_date_hour', 'signup_date_dayofweek', 
#  'signup_date_month', 'signup_date_is_weekend']

set_output API (sklearn >= 1.2)

Transformers with Inverse Transform

Some transformers benefit from an inverse_transform method that reverses the transformation. This is essential for interpretability—converting model outputs back to the original feature scale:

inverse_transform.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted
import numpy as np
 
class BoxCoxTransformer(BaseEstimator, TransformerMixin):
    """
    Apply Box-Cox transformation with learned optimal lambda.
    
    Supports inverse_transform for converting back to original scale.
    
    Parameters
    ----------
    lmbda : float or 'auto', default='auto'
        Box-Cox parameter. If 'auto', learns optimal lambda from data.
    shift : float, default=0
        Value added to ensure all values are positive: y = x + shift
    
    Attributes
    ----------
    lmbda_ : float
        The actual lambda used (learned if lmbda='auto')
    shift_ : float
        The actual shift used (may be computed if data has non-positive values)
    """
    
    def __init__(self, lmbda='auto', shift=0):
        self.lmbda = lmbda
        self.shift = shift
    
    def fit(self, X, y=None):
        X = self._validate_data(X, reset=True)
        
        # Ensure data is positive
        min_val = np.min(X)
        if min_val <= 0:
            self.shift_ = -min_val + 1e-6
        else:
            self.shift_ = self.shift
        
        X_shifted = X + self.shift_
        
        # Learn optimal lambda or use provided
        if self.lmbda == 'auto':
            from scipy import stats
            # Learn lambda that maximizes log-likelihood
            _, self.lmbda_ = stats.boxcox(X_shifted.flatten())
        else:
            self.lmbda_ = float(self.lmbda)
        
        return self
    
    def transform(self, X):
        check_is_fitted(self, ['lmbda_', 'shift_'])
        X = self._validate_data(X, reset=False)
        
        X_shifted = X + self.shift_
        
        # Box-Cox transformation
        if np.abs(self.lmbda_) < 1e-10:
            # Lambda ≈ 0: use log transformation
            return np.log(X_shifted)
        else:
            return (np.power(X_shifted, self.lmbda_) - 1) / self.lmbda_
    
    def inverse_transform(self, X_transformed):
        """
        Reverse the Box-Cox transformation.
        
        Parameters
        ----------
        X_transformed : array-like of shape (n_samples, n_features)
            Data in transformed space.
        
        Returns
        -------
        X : ndarray of shape (n_samples, n_features)
            Data in original space.
        """
        check_is_fitted(self, ['lmbda_', 'shift_'])
        X_transformed = np.asarray(X_transformed)
        
        # Inverse Box-Cox
        if np.abs(self.lmbda_) < 1e-10:
            X_shifted = np.exp(X_transformed)
        else:
            X_shifted = np.power(X_transformed * self.lmbda_ + 1, 1 / self.lmbda_)
        
        # Remove shift
        return X_shifted - self.shift_
 
 
# Usage: Transform and inverse transform
np.random.seed(42)
X = np.random.exponential(scale=2.0, size=(100, 2))  # Right-skewed data
 
transformer = BoxCoxTransformer(lmbda='auto')
X_transformed = transformer.fit_transform(X)
X_reconstructed = transformer.inverse_transform(X_transformed)
 
print(f"Learned lambda: {transformer.lmbda_:.4f}")
print(f"Original mean: {X.mean():.4f}")
print(f"Transformed mean: {X_transformed.mean():.4f}")
print(f"Reconstruction error: {np.abs(X - X_reconstructed).max():.2e}")
 
 
# Practical use: Inverse transform predictions
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
 
# Transform target variable
y = np.random.exponential(scale=100, size=100) + 50
y_transformer = BoxCoxTransformer(lmbda='auto')
y_transformed = y_transformer.fit_transform(y.reshape(-1, 1)).ravel()
 
# Train on transformed target
model = LinearRegression()
model.fit(X, y_transformed)
 
# Predict in transformed space
y_pred_transformed = model.predict(X[:5])
 
# Convert predictions back to original scale
y_pred_original = y_transformer.inverse_transform(y_pred_transformed.reshape(-1, 1))
print(f"Predictions in original scale: {y_pred_original.ravel()}")

When To Implement inverse_transform

Compatibility with Clone and Cross-Validation

clone_compatibility.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
from sklearn.base import BaseEstimator, TransformerMixin, clone
import numpy as np
 
# BROKEN: Won't clone correctly
class BrokenTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, threshold=0.5, **kwargs):
        self.threshold = threshold
        self.extra_stuff = kwargs  # BAD: arbitrary kwargs break cloning
        self._precomputed = self.threshold * 2  # BAD: computation in __init__
    
    def fit(self, X, y=None):
        return self
 
# Attempting to clone
broken = BrokenTransformer(threshold=0.7, random_kwarg=42)
try:
    cloned = clone(broken)
except Exception as e:
    print(f"Clone failed: {e}")
 
 
# CORRECT: Follows cloning requirements
class CorrectTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, threshold=0.5):
        # Rule 1: Parameters stored with EXACT same name as argument
        self.threshold = threshold
        # Rule 2: NO computation, NO derived values
    
    def fit(self, X, y=None):
        X = self._validate_data(X, reset=True)
        # Computed values go here, with trailing underscore
        self.threshold_value_ = self.threshold * np.std(X)
        return self
    
    def transform(self, X):
        X = self._validate_data(X, reset=False)
        return np.where(X > self.threshold_value_, X, 0)
 
# Cloning works
correct = CorrectTransformer(threshold=0.7)
correct.fit(np.array([[1, 2], [3, 4]]))
print(f"Original threshold_value_: {correct.threshold_value_}")
 
cloned = clone(correct)
print(f"Cloned has threshold: {cloned.threshold}")
print(f"Cloned is fitted: {hasattr(cloned, 'threshold_value_')}")  # False - unfitted!
 
 
# Verifying clone behavior
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
 
pipeline = Pipeline([
    ('transformer', CorrectTransformer(threshold=0.5)),
    ('classifier', LogisticRegression())
])
 
X = np.random.randn(100, 5)
y = (X[:, 0] > 0).astype(int)
 
# cross_val_score clones the pipeline for each fold
scores = cross_val_score(pipeline, X, y, cv=5)
print(f"CV scores: {scores}")  # Works because transformer is clonable
 
 
# Testing clonability
def test_clonable(estimator):
    """Test that an estimator can be cloned correctly."""
    try:
        cloned = clone(estimator)
        
        # Check parameters match
        orig_params = estimator.get_params()
        clone_params = cloned.get_params()
        
        for key in orig_params:
            if orig_params[key] != clone_params[key]:
                print(f"Parameter mismatch: {key}")
                return False
        
        # Check clone is unfitted
        fitted_attrs = [a for a in dir(cloned) if a.endswith('_') and not a.startswith('_')]
        if fitted_attrs:
            print(f"Clone appears fitted: {fitted_attrs}")
            return False
        
        print("Estimator is clonable ✓")
        return True
        
    except Exception as e:
        print(f"Clone failed: {e}")
        return False
 
test_clonable(CorrectTransformer(threshold=0.5))

Clonability Checklist

•All __init__ parameters stored as attributes with identical names — __init__(self, x=1) must store self.x = x
•No computation in __init__ — Derived values computed in fit() break clone's assumption that __init__(params) creates equivalent unfitted estimator
•No *args or **kwargs — These cannot be introspected by get_params() for cloning
•Learned attributes use trailing underscore — self.mean_, self.coef_, etc. Clone creates unfitted copy without these
•All attributes are pickleable — If you store lambdas or file handles, serialization fails

The __init__ Pitfall

Testing Custom Transformers

Robust custom transformers require thorough testing. Scikit-learn provides utilities to verify estimator compliance, and you should add domain-specific tests:

testing_transformers.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import numpy as np
import pytest
from sklearn.utils.estimator_checks import check_estimator, parametrize_with_checks
from sklearn.base import clone
 
# Your custom transformer
class MyTransformer:
    # ... implementation ...
    pass
 
 
# Method 1: Run all sklearn checks (comprehensive but slow)
def test_sklearn_compliance():
    """Run sklearn's estimator checks."""
    transformer = MyTransformer()
    
    # This runs ~30 checks for transformers
    for estimator, check in check_estimator(transformer, generate_only=True):
        try:
            check(estimator)
        except Exception as e:
            print(f"Check {check.func.__name__} failed: {e}")
 
 
# Method 2: Use pytest parametrize (recommended for CI)
@parametrize_with_checks([MyTransformer()])
def test_sklearn_compatible(estimator, check):
    check(estimator)
 
 
# Method 3: Manual essential tests (faster, targeted)
class TestMyTransformer:
    
    @pytest.fixture
    def sample_data(self):
        np.random.seed(42)
        return np.random.randn(100, 5)
    
    @pytest.fixture
    def transformer(self):
        return MyTransformer(param=1.0)
    
    def test_fit_returns_self(self, transformer, sample_data):
        """Verify fit() returns self for method chaining."""
        result = transformer.fit(sample_data)
        assert result is transformer
    
    def test_transform_shape(self, transformer, sample_data):
        """Verify transform preserves number of samples."""
        transformer.fit(sample_data)
        transformed = transformer.transform(sample_data)
        assert transformed.shape[0] == sample_data.shape[0]
    
    def test_fit_transform_equals_fit_then_transform(self, transformer, sample_data):
        """Verify fit_transform consistency."""
        t1 = clone(transformer)
        t2 = clone(transformer)
        
        result1 = t1.fit_transform(sample_data)
        result2 = t2.fit(sample_data).transform(sample_data)
        
        np.testing.assert_array_almost_equal(result1, result2)
    
    def test_transform_without_fit_raises(self, transformer, sample_data):
        """Verify transform before fit raises error."""
        with pytest.raises(Exception):  # NotFittedError or similar
            transformer.transform(sample_data)
    
    def test_clone_produces_unfitted_copy(self, transformer, sample_data):
        """Verify clone creates unfitted estimator."""
        transformer.fit(sample_data)
        cloned = clone(transformer)
        
        # Clone should not be fitted
        assert not hasattr(cloned, 'n_features_in_') or \
               getattr(cloned, 'n_features_in_', None) is None
    
    def test_get_set_params(self, transformer):
        """Verify parameter introspection."""
        params = transformer.get_params()
        assert 'param' in params
        
        new_transformer = transformer.set_params(param=2.0)
        assert new_transformer.param == 2.0
    
    def test_handles_pandas_dataframe(self, transformer):
        """Verify DataFrame compatibility."""
        import pandas as pd
        df = pd.DataFrame(np.random.randn(50, 3), columns=['a', 'b', 'c'])
        
        transformer.fit(df)
        result = transformer.transform(df)
        # Should not raise
    
    def test_handles_nan(self, transformer, sample_data):
        """Verify NaN handling (if applicable)."""
        data_with_nan = sample_data.copy()
        data_with_nan[0, 0] = np.nan
        
        # Either handles gracefully or raises informative error
        try:
            transformer.fit(data_with_nan)
        except ValueError as e:
            assert 'nan' in str(e).lower() or 'missing' in str(e).lower()
    
    def test_pickling(self, transformer, sample_data):
        """Verify transformer can be serialized."""
        import pickle
        
        transformer.fit(sample_data)
        pickled = pickle.dumps(transformer)
        unpickled = pickle.loads(pickled)
        
        np.testing.assert_array_almost_equal(
            transformer.transform(sample_data),
            unpickled.transform(sample_data)
        )

Testing Strategy

Summary: Custom Transformers

Custom transformers unlock the ability to incorporate arbitrary domain logic into scikit-learn workflows. Let's consolidate the key insights:

Key Takeaways

•The transformer contract is precise — Implement fit(), transform(), return self from fit, store parameters exactly as named in __init__.
•Stateless vs stateful matters for correctness — Stateful transformers must learn parameters during fit(); stateless can use FunctionTransformer.
•FunctionTransformer handles simple cases — Wrap any pure function, but beware of lambda serialization issues.
•Production transformers need robustness — Validate inputs, handle edge cases, provide helpful error messages, track metadata.
•DataFrame support enables modern workflows — Use set_output API or custom wrappers to preserve DataFrame semantics.
•Inverse transforms aid interpretability — Implement when mathematically possible to convert back to original scale.
•Clonability requires strict init discipline — No computation, no derived values, no *args/**kwargs.
•Test thoroughly — Use sklearn's check_estimator() plus custom edge case tests.

What's Next:

Page Complete

3 / 5