Feature Transformation Pipelines - Learning Module

Loading content...

0/278

Pipeline Serialization

The Persistence Challenge

You've built a sophisticated feature transformation pipeline. It handles missing values, scales numeric features, encodes categoricals, extracts datetime components, and applies domain-specific transformations. Training took hours across gigabytes of data. Now you need to use this pipeline in production.

But there's a problem: your training script is a Python interpreter session, and that session will end. When a new request arrives in your prediction service tomorrow, how does it access that fitted pipeline?

The answer is serialization—the process of converting in-memory Python objects into a storable format that can be reconstructed later. For machine learning, serialization is the bridge between model training and model serving. Getting it right is essential for:

Reproducibility: Recreating exact predictions on any machine
Deployment: Moving models from training environments to production servers
Versioning: Tracking which model version produced which predictions
Recovery: Restoring models after system failures

What You Will Learn

By the end of this page, you will understand the serialization options available for scikit-learn pipelines, their tradeoffs, and production best practices. You'll learn to avoid common pitfalls around version compatibility, handle custom transformers, and implement robust serialization workflows.

Pickle vs Joblib

Python provides two primary approaches for object serialization: the built-in pickle module and the joblib library. Both serialize Python objects to bytes and back, but they have important differences for ML workloads.

pickle_vs_joblib.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import pickle
import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np
 
# Create and fit a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])
 
X = np.random.randn(1000, 10)
y = (X[:, 0] > 0).astype(int)
pipeline.fit(X, y)
 
# ===== PICKLE =====
# Basic Python serialization
 
# Save with pickle
with open('model_pickle.pkl', 'wb') as f:
    pickle.dump(pipeline, f)
 
# Load with pickle
with open('model_pickle.pkl', 'rb') as f:
    loaded_pickle = pickle.load(f)
 
# Verify
print(f"Pickle predictions match: "
      f"{np.allclose(pipeline.predict(X), loaded_pickle.predict(X))}")
 
 
# ===== JOBLIB =====
# Optimized for large numpy arrays
 
# Save with joblib (simple API)
joblib.dump(pipeline, 'model_joblib.pkl')
 
# Load with joblib
loaded_joblib = joblib.load('model_joblib.pkl')
 
# Verify
print(f"Joblib predictions match: "
      f"{np.allclose(pipeline.predict(X), loaded_joblib.predict(X))}")
 
 
# With compression (saves disk space, slightly slower)
joblib.dump(pipeline, 'model_joblib.pkl.gz', compress=3)
joblib.dump(pipeline, 'model_joblib.pkl.lz4', compress=('lz4', 3))  # Fast compression
 
# Memory-mapped loading (for huge arrays)
# Avoids loading entire file into memory
loaded_mmap = joblib.load('model_joblib.pkl', mmap_mode='r')

Pickle vs Joblib Comparison
Aspect	pickle	joblib
Built-in	Yes, standard library	No, requires `pip install joblib`
Large numpy arrays	Slow, high memory	Optimized, uses numpy memmap
Compression	No built-in support	Built-in (gzip, lz4, etc.)
Memory mapping	No	Yes, `mmap_mode` parameter
sklearn recommendation	Not recommended	Officially recommended
File size (typical)	Larger for array-heavy models	Smaller, more efficient

Use Joblib for scikit-learn

Scikit-learn's official documentation recommends joblib over pickle. The performance difference is dramatic for models with large internal arrays (decision trees, neural networks, SVMs with many support vectors). Always use joblib.dump() and joblib.load() for sklearn pipelines.

Serialization Gotchas

Serialization sounds simple—save bytes, load bytes. But production deployments encounter numerous failure modes that training environments never reveal:

serialization_gotchas.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
import numpy as np
 
# ===== GOTCHA 1: Lambda functions can't be pickled =====
 
# This FAILS
bad_pipeline = Pipeline([
    ('transform', FunctionTransformer(lambda x: x ** 2))
])
 
bad_pipeline.fit(np.array([[1, 2], [3, 4]]))
 
try:
    joblib.dump(bad_pipeline, 'fails.pkl')
except Exception as e:
    print(f"Lambda serialization failed: {type(e).__name__}")
 
 
# SOLUTION: Use named functions
def square(x):
    return x ** 2
 
good_pipeline = Pipeline([
    ('transform', FunctionTransformer(square))
])
 
joblib.dump(good_pipeline, 'works.pkl')  # OK
 
 
# ===== GOTCHA 2: Version mismatch =====
 
# Serialized with sklearn 1.2.0, loaded with sklearn 1.3.0
# May work, may fail, may give wrong predictions
 
# BAD: No version tracking
joblib.dump(pipeline, 'model.pkl')
 
# BETTER: Include version metadata
import sklearn
import platform
 
model_package = {
    'pipeline': pipeline,
    'sklearn_version': sklearn.__version__,
    'python_version': platform.python_version(),
    'numpy_version': np.__version__,
    'trained_at': '2024-01-15T10:30:00Z'
}
 
joblib.dump(model_package, 'model_with_metadata.pkl')
 
# Load with version check
loaded = joblib.load('model_with_metadata.pkl')
if loaded['sklearn_version'] != sklearn.__version__:
    print(f"Warning: Trained with sklearn {loaded['sklearn_version']}, "
          f"loading with {sklearn.__version__}")
 
 
# ===== GOTCHA 3: Custom transformer module changes =====
 
# If your custom transformer is in my_transformers.py:
# from my_transformers import MyCustomTransformer
 
# And you rename/move that file, deserialization FAILS:
# ModuleNotFoundError: No module named 'my_transformers'
 
# SOLUTION: Keep transformer code in stable locations
# Or use cloudpickle for more flexible serialization
 
 
# ===== GOTCHA 4: Pandas DataFrame transformers with changed columns =====
 
# Pipeline fitted on DataFrame with columns ['a', 'b', 'c']
# Production data has different column order or names
 
# SOLUTION: Validate schema before prediction
def validate_input_schema(X, expected_columns):
    if list(X.columns) != expected_columns:
        missing = set(expected_columns) - set(X.columns)
        extra = set(X.columns) - set(expected_columns)
        raise ValueError(
            f"Schema mismatch. Missing: {missing}, Extra: {extra}"
        )
 
 
# ===== GOTCHA 5: Large model files =====
 
# Deep learning or ensemble models can be gigabytes
# Loading takes seconds, using memory during load
 
# SOLUTION 1: Use compression
joblib.dump(pipeline, 'model.pkl.gz', compress=('gzip', 3))
 
# SOLUTION 2: Use memory mapping for read-only access
loaded = joblib.load('model.pkl', mmap_mode='r')
 
# SOLUTION 3: Store models in object storage, load on demand
# s3://bucket/models/model_v1.pkl.gz

Common Serialization Failures

•Lambda functions — Can't be pickled. Use named functions or functools.partial.
•Version mismatch — sklearn 1.2 models may not load correctly in sklearn 1.3. Always track versions.
•Module relocation — If you move the file containing custom transformers, deserialization fails.
•Schema drift — Production data may have different columns than training data.
•Memory limits — Large models can exhaust memory during loading.

Deserialization is Code Execution

Pickle and joblib execute arbitrary code during deserialization. Never load pickled objects from untrusted sources. A malicious pickle file can execute arbitrary commands on your system. This is a critical security consideration for production systems that load models from external sources.

SKOPS and Secure Serialization

The security risks of pickle have led to the development of safer alternatives. SKOPS (scikit-learn operations) provides a secure serialization format specifically designed for sklearn models:

skops_serialization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# pip install skops
 
from skops import io as sio
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np
 
# Create and fit pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])
 
X = np.random.randn(100, 5)
y = (X[:, 0] > 0).astype(int)
pipeline.fit(X, y)
 
# ===== Save with SKOPS =====
sio.dump(pipeline, 'model.skops')
 
# ===== Load with SKOPS (secure by default) =====
# SKOPS validates types before executing any code
# Unknown types are flagged and require explicit approval
 
# Get untrusted types
unknown_types = sio.get_untrusted_types(file='model.skops')
print(f"Unknown types: {unknown_types}")
 
# If all types are trusted sklearn types, proceed
loaded = sio.load('model.skops', trusted=True)
 
# Or explicitly trust specific types
loaded = sio.load('model.skops', trusted=['numpy.dtype'])
 
 
# ===== SKOPS with Hub integration =====
# Push models to Hugging Face Hub
 
from skops import hub_utils
 
# Create a model card (documentation)
hub_utils.init(
    model='model.skops',
    dst='./my-model-repo',
    task='tabular-classification',
    data=X,  # Optional: include sample data
)
 
# Add model card content
model_card = hub_utils.get_model_card('./my-model-repo')
model_card.add(
    plot_name='Confusion Matrix',
    section='Model description/Evaluation',
    path='./confusion_matrix.png'
)
model_card.save('./my-model-repo/README.md')
 
# Push to Hub (requires HF authentication)
# hub_utils.push('./my-model-repo', repo_id='username/my-sklearn-model')
 
 
# ===== Comparison with pickle security =====
# Pickle: Executes ANY code during load
# SKOPS: Only reconstructs known safe types
 
# Demonstration of pickle vulnerability (DO NOT RUN WITH UNTRUSTED FILES)
import pickle
import os
 
class MaliciousObject:
    def __reduce__(self):
        # This would execute when unpickled
        return (os.system, ('echo HACKED',))
 
# malicious_bytes = pickle.dumps(MaliciousObject())
# pickle.loads(malicious_bytes)  # Would print "HACKED"
 
# SKOPS would reject this as untrusted type

SKOPS Advantages

•Secure by default—validates types before execution
•Hugging Face Hub integration
•Model cards for documentation
•Human-readable format for inspection

SKOPS Limitations

•Newer tool, less mature ecosystem
•Custom transformers need explicit trust
•Slightly larger file sizes
•Not all sklearn features fully supported

When to Use SKOPS

Consider SKOPS when: (1) Models are loaded from potentially untrusted sources, (2) You need model cards and Hub integration, (3) Security is a priority. For internal pipelines where you control both training and serving, joblib remains the pragmatic choice due to mature tooling.

Version Compatibility Strategies

Scikit-learn models serialized with one version may not load correctly—or may produce different predictions—with another version. This is a critical production concern, especially for long-lived models.

version_compatibility.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import joblib
import sklearn
import platform
import json
import hashlib
from datetime import datetime
import numpy as np
 
class ModelArtifact:
    """
    A wrapper that bundles a model with all necessary metadata
    for safe deserialization and version compatibility.
    """
    
    def __init__(self, pipeline, training_config=None):
        self.pipeline = pipeline
        self.sklearn_version = sklearn.__version__
        self.python_version = platform.python_version()
        self.numpy_version = np.__version__
        self.created_at = datetime.utcnow().isoformat()
        self.training_config = training_config or {}
        
        # Compute hash for integrity verification
        self.model_hash = self._compute_hash()
    
    def _compute_hash(self):
        """Compute a hash of model predictions for verification."""
        # Use fixed random state for reproducible verification
        rng = np.random.RandomState(42)
        X_verify = rng.randn(10, self.pipeline.n_features_in_)
        predictions = self.pipeline.predict(X_verify)
        return hashlib.sha256(predictions.tobytes()).hexdigest()[:16]
    
    def verify_predictions(self):
        """Verify model produces expected predictions."""
        rng = np.random.RandomState(42)
        X_verify = rng.randn(10, self.pipeline.n_features_in_)
        predictions = self.pipeline.predict(X_verify)
        current_hash = hashlib.sha256(predictions.tobytes()).hexdigest()[:16]
        return current_hash == self.model_hash
    
    def check_compatibility(self):
        """Check if current environment matches training environment."""
        warnings = []
        
        # Major version differences are risky
        saved_major = self.sklearn_version.split('.')[0]
        current_major = sklearn.__version__.split('.')[0]
        if saved_major != current_major:
            warnings.append(
                f"sklearn major version mismatch: "
                f"saved {self.sklearn_version}, current {sklearn.__version__}"
            )
        
        # Minor version differences may be OK but worth noting
        elif self.sklearn_version != sklearn.__version__:
            warnings.append(
                f"sklearn version differs: "
                f"saved {self.sklearn_version}, current {sklearn.__version__}"
            )
        
        return warnings
    
    def save(self, path):
        """Save model artifact with metadata."""
        joblib.dump(self, path)
        
        # Also save human-readable metadata
        metadata = {
            'sklearn_version': self.sklearn_version,
            'python_version': self.python_version,
            'numpy_version': self.numpy_version,
            'created_at': self.created_at,
            'model_hash': self.model_hash,
            'training_config': self.training_config
        }
        with open(f"{path}.meta.json", 'w') as f:
            json.dump(metadata, f, indent=2)
    
    @staticmethod
    def load(path, verify=True):
        """Load model artifact with compatibility checks."""
        artifact = joblib.load(path)
        
        # Check compatibility
        warnings = artifact.check_compatibility()
        for warning in warnings:
            print(f"Warning: {warning}")
        
        # Verify predictions still match
        if verify and not artifact.verify_predictions():
            raise RuntimeError(
                "Model prediction verification failed! "
                "The model may have been corrupted or version incompatibility "
                "is causing different predictions."
            )
        
        return artifact
 
 
# Usage
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
 
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])
 
X = np.random.randn(100, 5)
y = (X[:, 0] > 0).astype(int)
pipeline.fit(X, y)
 
# Save with full metadata
artifact = ModelArtifact(
    pipeline,
    training_config={
        'training_samples': 100,
        'features': ['f1', 'f2', 'f3', 'f4', 'f5'],
        'target': 'binary_classification'
    }
)
 
artifact.save('models/model_v1.pkl')
 
# Load with verification
loaded = ModelArtifact.load('models/model_v1.pkl')
predictions = loaded.pipeline.predict(X[:5])

Version Compatibility Best Practices

•Always record sklearn version — Store it with the model and check on load.
•Pin dependencies in training environment — Use requirements.txt or conda environment.yaml.
•Test models after sklearn upgrades — Before upgrading production sklearn, verify all models still produce expected outputs.
•Use prediction verification — Store expected outputs for sample inputs; verify after loading.
•Retrain rather than convert — If versions are incompatible, retraining is safer than attempting conversion.

sklearn Versioning Policy

Scikit-learn maintains backward compatibility within minor versions (1.2.x → 1.2.y) but may break compatibility across minor versions (1.2.x → 1.3.x). Always test before deploying models across version boundaries, and prefer retraining when upgrading major versions.

Custom Transformers and Serialization

Custom transformers introduce additional serialization challenges. The code defining the transformer class must be available at load time, and the class definition must match exactly.

custom_transformer_serialization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
# ===== Challenge: Module must exist at load time =====
 
# training.py (training environment)
from my_transformers import MyCustomScaler  # Works during training
from sklearn.pipeline import Pipeline
 
pipeline = Pipeline([('scaler', MyCustomScaler())])
pipeline.fit(X, y)
joblib.dump(pipeline, 'model.pkl')
 
# serving.py (production environment)
# If my_transformers.py isn't available:
# ImportError: No module named 'my_transformers'
 
 
# ===== Solution 1: Package your transformers =====
 
# Create a proper Python package:
# my_ml_package/
#     __init__.py
#     transformers.py
#     pipelines.py
 
# Install in both training and serving: pip install my_ml_package
# Import from package: from my_ml_package.transformers import MyCustomScaler
 
 
# ===== Solution 2: Use cloudpickle for self-contained serialization =====
 
# pip install cloudpickle
import cloudpickle
 
# cloudpickle serializes the class definition along with the instance
cloudpickle.dump(pipeline, open('model_cloudpickle.pkl', 'wb'))
 
# Loading doesn't require the original module to exist
loaded = cloudpickle.load(open('model_cloudpickle.pkl', 'rb'))
 
 
# ===== Solution 3: Define transformers inline (for simple cases) =====
 
# Instead of:
# from external_module import CustomTransformer
 
# Define in the same module that's always available:
from sklearn.base import BaseEstimator, TransformerMixin
 
class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, param=1.0):
        self.param = param
    
    def fit(self, X, y=None):
        self.fitted_ = True
        return self
    
    def transform(self, X):
        return X * self.param
 
# This class definition must exist at both training and serving time
 
 
# ===== Solution 4: Store class source code with model =====
 
import inspect
import types
 
def save_pipeline_with_source(pipeline, path):
    """Save pipeline along with custom transformer source code."""
    
    custom_sources = {}
    
    # Find all custom transformers and save their source
    for name, step in pipeline.named_steps.items():
        cls = type(step)
        if cls.__module__ != 'sklearn':  # Custom class
            try:
                source = inspect.getsource(cls)
                custom_sources[cls.__name__] = source
            except:
                pass
    
    package = {
        'pipeline': pipeline,
        'custom_sources': custom_sources
    }
    
    joblib.dump(package, path)
 
 
def load_pipeline_with_source(path):
    """Load pipeline, reconstructing custom transformers if needed."""
    package = joblib.load(path)
    
    # If loading fails due to missing modules, you could:
    # 1. Execute the saved source code
    # 2. Dynamically create the class
    # (This is advanced and has security implications)
    
    return package['pipeline']
 
 
# ===== Best Practice: Consistent module structure =====
 
# Good: transformers in well-organized package
# myproject/
#     models/
#         __init__.py
#         transformers.py       # Custom transformers here
#         pipelines.py          # Pipeline construction here
#     training/
#         train.py
#     serving/
#         predict.py
 
# Both training and serving import from same package
# from myproject.models.transformers import MyTransformer

The Cloudpickle Escape Hatch

cloudpickle serializes class definitions along with instances, making models more portable. However, it increases file size and can break if the class uses features that changed between Python versions. Use it judiciously for development; prefer proper packaging for production.

Model Registry Patterns

Production ML systems need more than ad-hoc file saving. A model registry provides centralized storage, versioning, and lifecycle management for models. Several patterns exist, from simple file-based approaches to full-featured MLOps platforms.

model_registry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import os
import json
import shutil
from datetime import datetime
import hashlib
import joblib
from pathlib import Path
 
class SimpleFileRegistry:
    """
    A simple file-based model registry for local development.
    
    Structure:
    registry_path/
        model_name/
            v1/
                model.pkl
                metadata.json
            v2/
                model.pkl
                metadata.json
            current -> v2/  (symlink to current version)
    """
    
    def __init__(self, registry_path='./model_registry'):
        self.registry_path = Path(registry_path)
        self.registry_path.mkdir(parents=True, exist_ok=True)
    
    def register(self, model, model_name, metadata=None):
        """Register a new model version."""
        model_dir = self.registry_path / model_name
        model_dir.mkdir(exist_ok=True)
        
        # Determine next version
        existing_versions = [
            int(d.name[1:]) for d in model_dir.iterdir()
            if d.is_dir() and d.name.startswith('v')
        ]
        next_version = max(existing_versions, default=0) + 1
        version_str = f'v{next_version}'
        
        # Create version directory
        version_dir = model_dir / version_str
        version_dir.mkdir()
        
        # Save model
        model_path = version_dir / 'model.pkl'
        joblib.dump(model, model_path)
        
        # Save metadata
        full_metadata = {
            'version': version_str,
            'registered_at': datetime.utcnow().isoformat(),
            'file_hash': self._compute_hash(model_path),
            **(metadata or {})
        }
        
        with open(version_dir / 'metadata.json', 'w') as f:
            json.dump(full_metadata, f, indent=2)
        
        print(f"Registered {model_name} {version_str}")
        return version_str
    
    def promote_to_current(self, model_name, version):
        """Set a specific version as 'current'."""
        model_dir = self.registry_path / model_name
        current_link = model_dir / 'current'
        
        # Remove existing symlink
        if current_link.exists():
            current_link.unlink()
        
        # Create new symlink (or copy on Windows)
        version_dir = model_dir / version
        if os.name == 'nt':  # Windows
            shutil.copytree(version_dir, current_link)
        else:
            current_link.symlink_to(version)
        
        print(f"Promoted {model_name} {version} to current")
    
    def load(self, model_name, version='current'):
        """Load a model by name and version."""
        model_path = self.registry_path / model_name / version / 'model.pkl'
        
        if not model_path.exists():
            raise FileNotFoundError(
                f"Model {model_name} {version} not found at {model_path}"
            )
        
        return joblib.load(model_path)
    
    def get_metadata(self, model_name, version='current'):
        """Get metadata for a model version."""
        meta_path = self.registry_path / model_name / version / 'metadata.json'
        with open(meta_path) as f:
            return json.load(f)
    
    def list_versions(self, model_name):
        """List all versions of a model."""
        model_dir = self.registry_path / model_name
        if not model_dir.exists():
            return []
        
        versions = []
        for d in sorted(model_dir.iterdir()):
            if d.is_dir() and d.name.startswith('v'):
                meta = self.get_metadata(model_name, d.name)
                versions.append({
                    'version': d.name,
                    'registered_at': meta.get('registered_at'),
                    'is_current': (model_dir / 'current').resolve() == d.resolve()
                })
        
        return versions
    
    def _compute_hash(self, path):
        with open(path, 'rb') as f:
            return hashlib.sha256(f.read()).hexdigest()[:16]
 
 
# Usage
registry = SimpleFileRegistry('./my_registry')
 
# Register models
registry.register(pipeline1, 'churn_predictor', {'accuracy': 0.85})
registry.register(pipeline2, 'churn_predictor', {'accuracy': 0.87})
 
# Promote best to current
registry.promote_to_current('churn_predictor', 'v2')
 
# Load in production
model = registry.load('churn_predictor')  # Loads current
 
# List versions
print(registry.list_versions('churn_predictor'))

Production-Grade Alternatives:

For production systems, consider established MLOps platforms:

Tool	Type	Key Features
MLflow	Open source	Experiment tracking, model registry, deployment
DVC	Open source	Data versioning, pipeline tracking, storage backends
Weights & Biases	SaaS	Experiment tracking, model versioning, visualizations
SageMaker Registry	AWS	Integrated with SageMaker, model approval workflows
Vertex AI	GCP	Integrated with Vertex AI, explainability, monitoring

Registry vs. Artifact Store

A model registry is metadata about models (versions, metrics, lineage). An artifact store is where model files actually live (S3, GCS, local filesystem). Production registries typically separate these concerns—the registry tracks what exists, while object storage actually stores the bytes.

MLflow Integration

MLflow is the most widely adopted open-source platform for ML lifecycle management. Its sklearn integration makes serialization, versioning, and deployment straightforward:

mlflow_integration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
# pip install mlflow
 
import mlflow
import mlflow.sklearn
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np
 
# Setup
mlflow.set_tracking_uri("sqlite:///mlflow.db")  # Or remote server
mlflow.set_experiment("churn-prediction")
 
# ===== Log a training run with model =====
 
X = np.random.randn(1000, 10)
y = (X[:, 0] > 0).astype(int)
 
with mlflow.start_run(run_name="baseline_logreg"):
    # Define pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(C=1.0, max_iter=1000))
    ])
    
    # Train
    pipeline.fit(X, y)
    
    # Evaluate
    cv_scores = cross_val_score(pipeline, X, y, cv=5)
    accuracy = cv_scores.mean()
    
    # Log parameters
    mlflow.log_param("C", 1.0)
    mlflow.log_param("max_iter", 1000)
    mlflow.log_param("scaler", "StandardScaler")
    
    # Log metrics
    mlflow.log_metric("cv_accuracy_mean", accuracy)
    mlflow.log_metric("cv_accuracy_std", cv_scores.std())
    
    # Log the model
    mlflow.sklearn.log_model(
        pipeline,
        artifact_path="model",
        registered_model_name="churn_predictor"  # Auto-register
    )
    
    print(f"Run ID: {mlflow.active_run().info.run_id}")
 
 
# ===== Load model from registry =====
 
# By run ID
model_uri = f"runs:/{run_id}/model"
loaded_model = mlflow.sklearn.load_model(model_uri)
 
# By model registry name and version
model_uri = "models:/churn_predictor/1"
loaded_model = mlflow.sklearn.load_model(model_uri)
 
# By model registry stage
model_uri = "models:/churn_predictor/Production"
loaded_model = mlflow.sklearn.load_model(model_uri)
 
 
# ===== Transition model stages =====
 
from mlflow.tracking import MlflowClient
 
client = MlflowClient()
 
# Transition to staging
client.transition_model_version_stage(
    name="churn_predictor",
    version=1,
    stage="Staging"
)
 
# After validation, promote to production
client.transition_model_version_stage(
    name="churn_predictor",
    version=1,
    stage="Production",
    archive_existing_versions=True  # Archive previous production version
)
 
 
# ===== Serve model via MLflow =====
# Command line: mlflow models serve -m "models:/churn_predictor/Production" -p 5000
 
# Or load for batch inference
import mlflow.pyfunc
 
model = mlflow.pyfunc.load_model("models:/churn_predictor/Production")
predictions = model.predict(X_new)

MLflow Key Features for Serialization

•Automatic dependency tracking — Records sklearn version, Python version, requirements
•Model signatures — Captures input/output schema for validation
•Stage management — Staging, Production, Archived stages for lifecycle
•Multiple storage backends — Local files, S3, Azure Blob, GCS
•One-command serving — mlflow models serve for quick deployment

MLflow Best Practices

Always use registered_model_name to auto-register models. Use model stages (Staging → Production) for controlled deployment. Store the tracking server URI in environment variables. Use the pyfunc interface for framework-agnostic loading.

Summary: Pipeline Serialization

Serialization bridges training and serving, enabling models to persist beyond the session that created them. Let's consolidate the key insights:

Key Takeaways

•Use joblib over pickle for sklearn — Optimized for numpy arrays, supports compression and memory mapping.
•Beware serialization gotchas — Lambdas can't be pickled, versions may be incompatible, custom transformer modules must exist at load time.
•Consider SKOPS for secure serialization — Validates types before execution, prevents arbitrary code execution.
•Track versions explicitly — Store sklearn/numpy/Python versions with models and verify on load.
•Package custom transformers properly — Use installable packages or cloudpickle for portability.
•Use a model registry — Versioning, staging, and lifecycle management prevent chaos at scale.
•MLflow provides production-grade tooling — Experiment tracking, model registry, and serving in one platform.

What's Next:

Serialized pipelines need to run in production environments very different from training notebooks. The final page covers Production Deployment—how to package, deploy, monitor, and operate feature transformation pipelines in real-world serving systems.

Page Complete

You now understand how to safely serialize sklearn pipelines with version tracking, security considerations, and registry integration. Next, we'll complete the journey from training to production with deployment strategies.