Loading learning content...
Scikit-learn provides an impressive arsenal of built-in transformers: scalers, encoders, imputers, feature selectors, and more. Yet real-world machine learning inevitably encounters transformations that no library anticipated.
Consider these scenarios:
None of these fit neatly into StandardScaler or OneHotEncoder. You need the ability to create custom transformers that encapsulate your domain logic while integrating seamlessly with Pipelines and ColumnTransformers.
By the end of this page, you will understand the scikit-learn transformer contract at a deep level and be able to implement robust custom transformers for any use case. You'll learn the difference between stateful and stateless transformers, how to ensure compatibility with cross-validation and grid search, and production-grade implementation patterns.
Before implementing custom transformers, we must deeply understand the contract they must satisfy. A scikit-learn transformer is any object that implements specific methods with specific signatures and behaviors.
Required Methods:
| Method | Signature | Behavior |
|---|---|---|
fit | fit(X, y=None) | Learn parameters from data. Returns self. |
transform | transform(X) | Apply transformation. Returns transformed data. |
Optional But Expected:
| Method | Signature | Behavior |
|---|---|---|
fit_transform | fit_transform(X, y=None) | Convenience: fit then transform. May be optimized. |
get_params | get_params(deep=True) | Return dict of constructor parameters. |
set_params | set_params(**params) | Set parameters. Returns self. |
get_feature_names_out | get_feature_names_out(input_features=None) | Return output feature names. |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
from sklearn.base import BaseEstimator, TransformerMixinimport numpy as np class LogTransformer(BaseEstimator, TransformerMixin): """ Apply log transformation to numerical features. Demonstrates the minimal transformer contract: - Inherits from BaseEstimator (get_params, set_params) - Inherits from TransformerMixin (fit_transform default implementation) - Implements fit() and transform() Parameters ---------- offset : float, default=1.0 Value added before log to handle zeros: log(x + offset) base : {'natural', '10', '2'}, default='natural' Logarithm base to use Attributes ---------- n_features_in_ : int Number of features seen during fit feature_names_in_ : ndarray of shape (n_features_in_,) Names of features seen during fit (if available) """ def __init__(self, offset=1.0, base='natural'): # Rule 1: Store constructor parameters as attributes with SAME NAME self.offset = offset self.base = base # Rule 2: NO computation in __init__. Only parameter storage. def fit(self, X, y=None): """ Fit: learn any necessary parameters from training data. For log transform, we don't learn anything data-dependent, but we still validate and record metadata. """ # Validate input using sklearn's validation utilities X = self._validate_data(X, reset=True) # _validate_data automatically sets: # - self.n_features_in_ (number of input features) # - self.feature_names_in_ (if X has feature names, e.g., DataFrame) # Validate parameter values if self.offset <= 0: raise ValueError(f"offset must be positive, got {self.offset}") if self.base not in {'natural', '10', '2'}: raise ValueError(f"base must be 'natural', '10', or '2', got {self.base}") # Rule 3: MUST return self for method chaining return self def transform(self, X): """ Transform: apply the log transformation. """ # Validate against fitted state X = self._validate_data(X, reset=False) # reset=False checks against fit # Rule 4: Don't modify input; work on copy X_transformed = X + self.offset if self.base == 'natural': X_transformed = np.log(X_transformed) elif self.base == '10': X_transformed = np.log10(X_transformed) elif self.base == '2': X_transformed = np.log2(X_transformed) return X_transformed def get_feature_names_out(self, input_features=None): """ Return feature names for output features. """ # Use sklearn's utility for consistent behavior from sklearn.utils.validation import _check_feature_names_in input_features = _check_feature_names_in(self, input_features) return np.asarray([f"log_{feat}" for feat in input_features]) # Usagetransformer = LogTransformer(offset=1.0, base='natural')X = np.array([[1, 10], [2, 20], [3, 30]])X_log = transformer.fit_transform(X)print(X_log)Forgetting return self in fit() is the single most common custom transformer bug. Without it, pipeline.fit(X, y) returns None, and subsequent operations fail with cryptic errors. Always end fit() with return self.
Transformers fall into two fundamental categories based on whether they learn from data:
Stateless Transformers: Apply the same transformation regardless of training data. The transformation is defined entirely by constructor parameters.
LogTransformer, PowerTransformer with fixed exponent, FunctionTransformerStateful Transformers: Learn transformation parameters from training data. The transformation depends on what was seen during fit().
StandardScaler (learns mean, std), OneHotEncoder (learns categories), PCA (learns components)This distinction has profound implications for cross-validation and data leakage.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
from sklearn.base import BaseEstimator, TransformerMixinfrom sklearn.preprocessing import FunctionTransformerimport numpy as np # Stateless Option 1: FunctionTransformer for simple casesdef winsorize(X, lower=0.05, upper=0.95): """Clip values to percentile range.""" lower_bound = np.percentile(X, lower * 100, axis=0) upper_bound = np.percentile(X, upper * 100, axis=0) return np.clip(X, lower_bound, upper_bound) # WARNING: This looks stateless but has a subtle bug!# Percentiles are computed on transform-time data, not fit-time data.# This causes train-test leakage: test data influences its own transformation. winsorizer = FunctionTransformer(winsorize) # Correct stateful implementation:class WinsorizeTransformer(BaseEstimator, TransformerMixin): """ Winsorize features by clipping to percentile bounds learned during fit. This is STATEFUL: percentile bounds are learned from training data and applied identically to all future data. """ def __init__(self, lower_percentile=5, upper_percentile=95): self.lower_percentile = lower_percentile self.upper_percentile = upper_percentile def fit(self, X, y=None): X = self._validate_data(X, reset=True) # LEARN bounds from training data (suffix _ indicates learned) self.lower_bounds_ = np.percentile(X, self.lower_percentile, axis=0) self.upper_bounds_ = np.percentile(X, self.upper_percentile, axis=0) return self def transform(self, X): X = self._validate_data(X, reset=False) # APPLY learned bounds (works correctly even if test percentiles differ) return np.clip(X, self.lower_bounds_, self.upper_bounds_) # True stateless transformer (transformation is fully parameterized)class PolynomialExpansion(BaseEstimator, TransformerMixin): """ Expand features with polynomial terms. This is STATELESS: the transformation is defined purely by parameters, not by any learning from data. """ def __init__(self, degree=2, include_bias=False): self.degree = degree self.include_bias = include_bias def fit(self, X, y=None): # Stateless: fit only validates and records metadata X = self._validate_data(X, reset=True) return self def transform(self, X): X = self._validate_data(X, reset=False) # Pure function of input and parameters, no learned state from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures( degree=self.degree, include_bias=self.include_bias ) return poly.fit_transform(X)Ask yourself: 'If I transform new data, should the result depend on what I saw during training?' If yes, your transformer is stateful and MUST learn parameters during fit(). If transforming [1, 2, 3] gives different results depending on whether you trained on [1, 100] vs [1, 1000], you have a stateful transformer.
For truly stateless transformations, scikit-learn provides FunctionTransformer as a convenient wrapper. It converts any function into a Pipeline-compatible transformer without writing a class:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
from sklearn.preprocessing import FunctionTransformerfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LogisticRegressionimport numpy as npimport pandas as pd # Basic usage: wrap any functionlog_transformer = FunctionTransformer(np.log1p) # With inverse for interpretabilitysqrt_transformer = FunctionTransformer( func=np.sqrt, inverse_func=np.square, # Enables inverse_transform validate=True, # Validate input check_inverse=True # Verify inverse_func(func(X)) ≈ X) # With feature namesdef extract_hour(X): """Extract hour from datetime column.""" return X.apply(lambda col: pd.to_datetime(col).dt.hour if col.dtype == 'object' else col) hour_extractor = FunctionTransformer( extract_hour, feature_names_out=lambda self, names: [f"{n}_hour" for n in names]) # Passing additional parameters via kw_argsdef clip_values(X, lower, upper): return np.clip(X, lower, upper) clipper = FunctionTransformer( clip_values, kw_args={'lower': 0, 'upper': 100}) # Lambda functions (use with caution: not pickleable!)# square = FunctionTransformer(lambda x: x ** 2) # Works locally# joblib.dump(square, 'model.pkl') # FAILS: lambda can't be pickled # Solution: use named functions or functools.partialfrom functools import partial def power(X, exponent): return X ** exponent square = FunctionTransformer(partial(power, exponent=2))# This IS pickleable # Complex example: chained in pipelinepreprocessing = Pipeline([ ('log', FunctionTransformer(np.log1p, validate=True)), ('clip', FunctionTransformer( lambda X: np.clip(X, 0, 10), feature_names_out='one-to-one' # Preserve feature names )),]) # Validation modevalidated_transformer = FunctionTransformer( np.exp, accept_sparse=False, # Don't accept sparse matrices check_inverse=False, validate=True # Run sklearn validation)FunctionTransformer with lambda functions cannot be pickled/serialized. For production models, use named functions, functools.partial, or full custom transformer classes. This is a critical production deployment consideration.
Most real-world custom transformers are stateful. Building them correctly requires careful attention to the estimator contract, validation, and edge cases. Let's build a production-grade example:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
from sklearn.base import BaseEstimator, TransformerMixin, OneToOneFeatureMixinfrom sklearn.utils.validation import check_is_fittedimport numpy as npimport warnings class RobustOutlierClipper(BaseEstimator, TransformerMixin, OneToOneFeatureMixin): """ Clip outliers using IQR-based bounds learned during fit. This transformer learns clipping bounds from training data using the Interquartile Range (IQR) method, commonly used in robust statistics. Parameters ---------- iqr_multiplier : float, default=1.5 Multiplier for IQR to determine outlier bounds. 1.5 gives standard "outlier" detection; 3.0 gives "extreme outlier". per_feature : bool, default=True If True, learn separate bounds per feature. If False, learn global bounds across all features. Attributes ---------- lower_bounds_ : ndarray of shape (n_features,) Lower clipping bounds learned during fit. upper_bounds_ : ndarray of shape (n_features,) Upper clipping bounds learned during fit. n_features_in_ : int Number of features seen during fit. n_clipped_ : dict Count of values clipped during last transform (for monitoring). Examples -------- >>> import numpy as np >>> X = np.array([[1, 100], [2, 200], [3, 300], [4, 10000]]) >>> clipper = RobustOutlierClipper(iqr_multiplier=1.5) >>> clipper.fit_transform(X) """ # Class-level tags for sklearn compatibility _parameter_constraints = { 'iqr_multiplier': [float, int], 'per_feature': [bool], } def __init__(self, iqr_multiplier=1.5, per_feature=True): self.iqr_multiplier = iqr_multiplier self.per_feature = per_feature def fit(self, X, y=None): """ Learn clipping bounds from training data. Parameters ---------- X : array-like of shape (n_samples, n_features) Training data from which to learn bounds. y : Ignored Not used, present for API consistency. Returns ------- self : RobustOutlierClipper Fitted transformer. """ # Use sklearn's validation (handles DataFrame, sparse, etc.) X = self._validate_data( X, reset=True, dtype=np.float64, # Ensure numeric force_all_finite='allow-nan' # Allow NaN, we'll handle it ) # Validate parameters if self.iqr_multiplier <= 0: raise ValueError( f"iqr_multiplier must be positive, got {self.iqr_multiplier}" ) # Handle NaN in statistics computation with warnings.catch_warnings(): warnings.simplefilter("ignore", RuntimeWarning) if self.per_feature: q1 = np.nanpercentile(X, 25, axis=0) q3 = np.nanpercentile(X, 75, axis=0) else: q1 = np.nanpercentile(X, 25) q3 = np.nanpercentile(X, 75) q1 = np.full(self.n_features_in_, q1) q3 = np.full(self.n_features_in_, q3) iqr = q3 - q1 # Learned parameters (trailing underscore convention) self.lower_bounds_ = q1 - self.iqr_multiplier * iqr self.upper_bounds_ = q3 + self.iqr_multiplier * iqr # Handle edge case: zero IQR (constant column) zero_iqr_mask = iqr == 0 if np.any(zero_iqr_mask): # For constant columns, don't clip (bounds = ±inf) self.lower_bounds_[zero_iqr_mask] = -np.inf self.upper_bounds_[zero_iqr_mask] = np.inf warnings.warn( f"Columns {np.where(zero_iqr_mask)[0]} have zero IQR; " "clipping disabled for these features.", UserWarning ) return self def transform(self, X): """ Clip outliers using bounds learned during fit. Parameters ---------- X : array-like of shape (n_samples, n_features) Data to transform. Returns ------- X_clipped : ndarray of shape (n_samples, n_features) Transformed data with outliers clipped. """ # Verify fit has been called check_is_fitted(self, ['lower_bounds_', 'upper_bounds_']) # Validate and convert input X = self._validate_data( X, reset=False, # Don't reset n_features_in_ dtype=np.float64, force_all_finite='allow-nan' ) # Clip to learned bounds X_clipped = np.clip(X, self.lower_bounds_, self.upper_bounds_) # Track clipping statistics (useful for monitoring) self.n_clipped_ = { 'lower': int(np.sum(X < self.lower_bounds_)), 'upper': int(np.sum(X > self.upper_bounds_)), 'total': int(np.sum((X < self.lower_bounds_) | (X > self.upper_bounds_))) } return X_clipped def get_feature_names_out(self, input_features=None): """Return feature names (unchanged by clipping).""" check_is_fitted(self) return self._get_feature_names_out(input_features) # From OneToOneFeatureMixin # Usage with edge casesX_train = np.array([ [1.0, 100, 5.0], [2.0, 200, 5.0], # Third column is constant [3.0, 300, 5.0], [4.0, 400, 5.0], [100.0, 10000, 5.0] # Outlier in first two columns]) clipper = RobustOutlierClipper(iqr_multiplier=1.5)X_transformed = clipper.fit_transform(X_train) print(f"Bounds: lower={clipper.lower_bounds_}, upper={clipper.upper_bounds_}")print(f"Clipping stats: {clipper.n_clipped_}")High-quality transformers include: (1) Complete docstrings with parameters, attributes, and examples, (2) Parameter validation in fit(), (3) check_is_fitted() in transform(), (4) Proper handling of NaN and edge cases, (5) Monitoring attributes like n_clipped_ for observability.
Modern ML workflows often use pandas DataFrames. While scikit-learn internally converts to numpy arrays, we can build transformers that preserve DataFrame semantics and leverage pandas functionality:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150
from sklearn.base import BaseEstimator, TransformerMixinimport pandas as pdimport numpy as np class DatetimeFeatureExtractor(BaseEstimator, TransformerMixin): """ Extract datetime components from datetime columns. This transformer works with pandas DataFrames and extracts useful features like hour, day of week, month, etc. Parameters ---------- datetime_columns : list of str Column names containing datetime data features : list of str, default=['hour', 'dayofweek', 'month', 'year'] Features to extract. Options: 'hour', 'minute', 'second', 'dayofweek', 'day', 'month', 'year', 'quarter', 'is_weekend' drop_original : bool, default=True Whether to drop the original datetime columns """ def __init__(self, datetime_columns, features=None, drop_original=True): self.datetime_columns = datetime_columns self.features = features or ['hour', 'dayofweek', 'month', 'year'] self.drop_original = drop_original def fit(self, X, y=None): # Validate input is DataFrame if not isinstance(X, pd.DataFrame): raise TypeError( f"DatetimeFeatureExtractor requires pandas DataFrame, " f"got {type(X).__name__}" ) # Validate columns exist missing = set(self.datetime_columns) - set(X.columns) if missing: raise ValueError(f"Columns not found in DataFrame: {missing}") # Store column order for consistent output self.feature_names_in_ = np.array(X.columns.tolist()) self.n_features_in_ = len(self.feature_names_in_) return self def transform(self, X): if not isinstance(X, pd.DataFrame): raise TypeError( f"DatetimeFeatureExtractor requires pandas DataFrame, " f"got {type(X).__name__}" ) X = X.copy() for col in self.datetime_columns: # Convert to datetime if needed dt = pd.to_datetime(X[col]) # Extract requested features for feature in self.features: new_col_name = f"{col}_{feature}" if feature == 'hour': X[new_col_name] = dt.dt.hour elif feature == 'minute': X[new_col_name] = dt.dt.minute elif feature == 'second': X[new_col_name] = dt.dt.second elif feature == 'dayofweek': X[new_col_name] = dt.dt.dayofweek elif feature == 'day': X[new_col_name] = dt.dt.day elif feature == 'month': X[new_col_name] = dt.dt.month elif feature == 'year': X[new_col_name] = dt.dt.year elif feature == 'quarter': X[new_col_name] = dt.dt.quarter elif feature == 'is_weekend': X[new_col_name] = (dt.dt.dayofweek >= 5).astype(int) if self.drop_original: X = X.drop(columns=[col]) return X def get_feature_names_out(self, input_features=None): """Return output feature names.""" output_cols = [] for col in self.feature_names_in_: if col in self.datetime_columns: if not self.drop_original: output_cols.append(col) for feature in self.features: output_cols.append(f"{col}_{feature}") else: output_cols.append(col) return np.array(output_cols) # DataFrameTransformer wrapper for any sklearn transformerclass DataFrameWrapper(BaseEstimator, TransformerMixin): """ Wrap an sklearn transformer to preserve DataFrame output. Useful when you want numpy-based transformers to return DataFrames with proper column names. """ def __init__(self, transformer): self.transformer = transformer def fit(self, X, y=None): self.transformer.fit(X, y) if hasattr(self.transformer, 'feature_names_in_'): self.feature_names_in_ = self.transformer.feature_names_in_ return self def transform(self, X): X_transformed = self.transformer.transform(X) # Get feature names from transformer if available if hasattr(self.transformer, 'get_feature_names_out'): columns = self.transformer.get_feature_names_out() elif hasattr(X, 'columns'): columns = X.columns else: columns = [f'feature_{i}' for i in range(X_transformed.shape[1])] return pd.DataFrame(X_transformed, columns=columns, index=getattr(X, 'index', None)) # Usagedf = pd.DataFrame({ 'signup_date': ['2023-01-15 10:30:00', '2023-06-20 14:45:00'], 'amount': [100.0, 250.0]}) extractor = DatetimeFeatureExtractor( datetime_columns=['signup_date'], features=['hour', 'dayofweek', 'month', 'is_weekend']) df_transformed = extractor.fit_transform(df)print(df_transformed.columns.tolist())# ['amount', 'signup_date_hour', 'signup_date_dayofweek', # 'signup_date_month', 'signup_date_is_weekend']Scikit-learn 1.2+ includes a set_output API that makes transformers return DataFrames automatically. Use transformer.set_output(transform='pandas') to enable. This reduces the need for custom DataFrame-preserving transformers in newer sklearn versions.
Some transformers benefit from an inverse_transform method that reverses the transformation. This is essential for interpretability—converting model outputs back to the original feature scale:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
from sklearn.base import BaseEstimator, TransformerMixinfrom sklearn.utils.validation import check_is_fittedimport numpy as np class BoxCoxTransformer(BaseEstimator, TransformerMixin): """ Apply Box-Cox transformation with learned optimal lambda. Supports inverse_transform for converting back to original scale. Parameters ---------- lmbda : float or 'auto', default='auto' Box-Cox parameter. If 'auto', learns optimal lambda from data. shift : float, default=0 Value added to ensure all values are positive: y = x + shift Attributes ---------- lmbda_ : float The actual lambda used (learned if lmbda='auto') shift_ : float The actual shift used (may be computed if data has non-positive values) """ def __init__(self, lmbda='auto', shift=0): self.lmbda = lmbda self.shift = shift def fit(self, X, y=None): X = self._validate_data(X, reset=True) # Ensure data is positive min_val = np.min(X) if min_val <= 0: self.shift_ = -min_val + 1e-6 else: self.shift_ = self.shift X_shifted = X + self.shift_ # Learn optimal lambda or use provided if self.lmbda == 'auto': from scipy import stats # Learn lambda that maximizes log-likelihood _, self.lmbda_ = stats.boxcox(X_shifted.flatten()) else: self.lmbda_ = float(self.lmbda) return self def transform(self, X): check_is_fitted(self, ['lmbda_', 'shift_']) X = self._validate_data(X, reset=False) X_shifted = X + self.shift_ # Box-Cox transformation if np.abs(self.lmbda_) < 1e-10: # Lambda ≈ 0: use log transformation return np.log(X_shifted) else: return (np.power(X_shifted, self.lmbda_) - 1) / self.lmbda_ def inverse_transform(self, X_transformed): """ Reverse the Box-Cox transformation. Parameters ---------- X_transformed : array-like of shape (n_samples, n_features) Data in transformed space. Returns ------- X : ndarray of shape (n_samples, n_features) Data in original space. """ check_is_fitted(self, ['lmbda_', 'shift_']) X_transformed = np.asarray(X_transformed) # Inverse Box-Cox if np.abs(self.lmbda_) < 1e-10: X_shifted = np.exp(X_transformed) else: X_shifted = np.power(X_transformed * self.lmbda_ + 1, 1 / self.lmbda_) # Remove shift return X_shifted - self.shift_ # Usage: Transform and inverse transformnp.random.seed(42)X = np.random.exponential(scale=2.0, size=(100, 2)) # Right-skewed data transformer = BoxCoxTransformer(lmbda='auto')X_transformed = transformer.fit_transform(X)X_reconstructed = transformer.inverse_transform(X_transformed) print(f"Learned lambda: {transformer.lmbda_:.4f}")print(f"Original mean: {X.mean():.4f}")print(f"Transformed mean: {X_transformed.mean():.4f}")print(f"Reconstruction error: {np.abs(X - X_reconstructed).max():.2e}") # Practical use: Inverse transform predictionsfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LinearRegression # Transform target variabley = np.random.exponential(scale=100, size=100) + 50y_transformer = BoxCoxTransformer(lmbda='auto')y_transformed = y_transformer.fit_transform(y.reshape(-1, 1)).ravel() # Train on transformed targetmodel = LinearRegression()model.fit(X, y_transformed) # Predict in transformed spacey_pred_transformed = model.predict(X[:5]) # Convert predictions back to original scaley_pred_original = y_transformer.inverse_transform(y_pred_transformed.reshape(-1, 1))print(f"Predictions in original scale: {y_pred_original.ravel()}")Implement inverse_transform when: (1) You're transforming targets and need predictions in original units, (2) You need to explain feature effects in original scale, (3) The transformation is mathematically invertible, (4) You're building pipelines that support full round-trip transformation.
Scikit-learn's cross-validation and grid search rely on clone() to create fresh copies of estimators for each fold. Custom transformers must be "clonable" to work correctly. This requires strict adherence to the estimator contract:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
from sklearn.base import BaseEstimator, TransformerMixin, cloneimport numpy as np # BROKEN: Won't clone correctlyclass BrokenTransformer(BaseEstimator, TransformerMixin): def __init__(self, threshold=0.5, **kwargs): self.threshold = threshold self.extra_stuff = kwargs # BAD: arbitrary kwargs break cloning self._precomputed = self.threshold * 2 # BAD: computation in __init__ def fit(self, X, y=None): return self # Attempting to clonebroken = BrokenTransformer(threshold=0.7, random_kwarg=42)try: cloned = clone(broken)except Exception as e: print(f"Clone failed: {e}") # CORRECT: Follows cloning requirementsclass CorrectTransformer(BaseEstimator, TransformerMixin): def __init__(self, threshold=0.5): # Rule 1: Parameters stored with EXACT same name as argument self.threshold = threshold # Rule 2: NO computation, NO derived values def fit(self, X, y=None): X = self._validate_data(X, reset=True) # Computed values go here, with trailing underscore self.threshold_value_ = self.threshold * np.std(X) return self def transform(self, X): X = self._validate_data(X, reset=False) return np.where(X > self.threshold_value_, X, 0) # Cloning workscorrect = CorrectTransformer(threshold=0.7)correct.fit(np.array([[1, 2], [3, 4]]))print(f"Original threshold_value_: {correct.threshold_value_}") cloned = clone(correct)print(f"Cloned has threshold: {cloned.threshold}")print(f"Cloned is fitted: {hasattr(cloned, 'threshold_value_')}") # False - unfitted! # Verifying clone behaviorfrom sklearn.model_selection import cross_val_scorefrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LogisticRegression pipeline = Pipeline([ ('transformer', CorrectTransformer(threshold=0.5)), ('classifier', LogisticRegression())]) X = np.random.randn(100, 5)y = (X[:, 0] > 0).astype(int) # cross_val_score clones the pipeline for each foldscores = cross_val_score(pipeline, X, y, cv=5)print(f"CV scores: {scores}") # Works because transformer is clonable # Testing clonabilitydef test_clonable(estimator): """Test that an estimator can be cloned correctly.""" try: cloned = clone(estimator) # Check parameters match orig_params = estimator.get_params() clone_params = cloned.get_params() for key in orig_params: if orig_params[key] != clone_params[key]: print(f"Parameter mismatch: {key}") return False # Check clone is unfitted fitted_attrs = [a for a in dir(cloned) if a.endswith('_') and not a.startswith('_')] if fitted_attrs: print(f"Clone appears fitted: {fitted_attrs}") return False print("Estimator is clonable ✓") return True except Exception as e: print(f"Clone failed: {e}") return False test_clonable(CorrectTransformer(threshold=0.5))__init__ parameters stored as attributes with identical names — __init__(self, x=1) must store self.x = x__init__ — Derived values computed in fit() break clone's assumption that __init__(params) creates equivalent unfitted estimator*args or **kwargs — These cannot be introspected by get_params() for cloningself.mean_, self.coef_, etc. Clone creates unfitted copy without theseNever do self.derived_param = some_function(self.param) in __init__. Clone recreates the estimator by calling __init__(**get_params()) and expects an unfitted estimator. Derived values must be computed in fit() and stored with trailing underscore.
Robust custom transformers require thorough testing. Scikit-learn provides utilities to verify estimator compliance, and you should add domain-specific tests:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
import numpy as npimport pytestfrom sklearn.utils.estimator_checks import check_estimator, parametrize_with_checksfrom sklearn.base import clone # Your custom transformerclass MyTransformer: # ... implementation ... pass # Method 1: Run all sklearn checks (comprehensive but slow)def test_sklearn_compliance(): """Run sklearn's estimator checks.""" transformer = MyTransformer() # This runs ~30 checks for transformers for estimator, check in check_estimator(transformer, generate_only=True): try: check(estimator) except Exception as e: print(f"Check {check.func.__name__} failed: {e}") # Method 2: Use pytest parametrize (recommended for CI)@parametrize_with_checks([MyTransformer()])def test_sklearn_compatible(estimator, check): check(estimator) # Method 3: Manual essential tests (faster, targeted)class TestMyTransformer: @pytest.fixture def sample_data(self): np.random.seed(42) return np.random.randn(100, 5) @pytest.fixture def transformer(self): return MyTransformer(param=1.0) def test_fit_returns_self(self, transformer, sample_data): """Verify fit() returns self for method chaining.""" result = transformer.fit(sample_data) assert result is transformer def test_transform_shape(self, transformer, sample_data): """Verify transform preserves number of samples.""" transformer.fit(sample_data) transformed = transformer.transform(sample_data) assert transformed.shape[0] == sample_data.shape[0] def test_fit_transform_equals_fit_then_transform(self, transformer, sample_data): """Verify fit_transform consistency.""" t1 = clone(transformer) t2 = clone(transformer) result1 = t1.fit_transform(sample_data) result2 = t2.fit(sample_data).transform(sample_data) np.testing.assert_array_almost_equal(result1, result2) def test_transform_without_fit_raises(self, transformer, sample_data): """Verify transform before fit raises error.""" with pytest.raises(Exception): # NotFittedError or similar transformer.transform(sample_data) def test_clone_produces_unfitted_copy(self, transformer, sample_data): """Verify clone creates unfitted estimator.""" transformer.fit(sample_data) cloned = clone(transformer) # Clone should not be fitted assert not hasattr(cloned, 'n_features_in_') or \ getattr(cloned, 'n_features_in_', None) is None def test_get_set_params(self, transformer): """Verify parameter introspection.""" params = transformer.get_params() assert 'param' in params new_transformer = transformer.set_params(param=2.0) assert new_transformer.param == 2.0 def test_handles_pandas_dataframe(self, transformer): """Verify DataFrame compatibility.""" import pandas as pd df = pd.DataFrame(np.random.randn(50, 3), columns=['a', 'b', 'c']) transformer.fit(df) result = transformer.transform(df) # Should not raise def test_handles_nan(self, transformer, sample_data): """Verify NaN handling (if applicable).""" data_with_nan = sample_data.copy() data_with_nan[0, 0] = np.nan # Either handles gracefully or raises informative error try: transformer.fit(data_with_nan) except ValueError as e: assert 'nan' in str(e).lower() or 'missing' in str(e).lower() def test_pickling(self, transformer, sample_data): """Verify transformer can be serialized.""" import pickle transformer.fit(sample_data) pickled = pickle.dumps(transformer) unpickled = pickle.loads(pickled) np.testing.assert_array_almost_equal( transformer.transform(sample_data), unpickled.transform(sample_data) )Start with check_estimator() to catch contract violations. Then add domain-specific tests for your transformer's unique behavior. Include edge cases: empty data, single sample, single feature, all-NaN column, constant columns, mixed dtypes. Test serialization if used in production.
Custom transformers unlock the ability to incorporate arbitrary domain logic into scikit-learn workflows. Let's consolidate the key insights:
fit(), transform(), return self from fit, store parameters exactly as named in __init__.set_output API or custom wrappers to preserve DataFrame semantics.check_estimator() plus custom edge case tests.What's Next:
Once you've built robust transformation pipelines, you need to save them for production use. The next page covers Pipeline Serialization—how to persist fitted pipelines, manage versioning, handle compatibility, and ensure reliable deployment.
You now have the knowledge to build production-grade custom transformers that integrate with scikit-learn's ecosystem. Next, we'll learn how to serialize these transformers and pipelines for deployment and reproducibility.