Loading learning content...
If ML models are the brains of intelligent systems, data pipelines are the circulatory system—invisible but essential, moving the vital resources that keep everything alive. The most sophisticated model architecture is worthless without robust data infrastructure to feed it.
Data pipeline design for ML systems is fundamentally more complex than traditional ETL. ML pipelines must serve two masters simultaneously: training, which processes historical data to create models, and serving, which processes live data to power predictions. These two paths must produce identical feature values—any divergence creates training-serving skew, one of the most insidious bugs in production ML.
This page covers the engineering principles, architectural patterns, and practical considerations for building data pipelines that reliably transform raw data into ML-ready features at scale.
By the end of this page, you will understand how to design data pipelines for ML systems—from batch training pipelines to real-time feature serving, from feature stores to online-offline consistency guarantees. You'll learn to architect data infrastructure that enables reliable, scalable, and maintainable ML systems.
ML data pipelines are not a single system but an ecosystem of interconnected components, each serving specific purposes in the model lifecycle. Understanding this landscape is the first step toward effective design.
The Four Pipeline Types:
| Pipeline Type | Purpose | Latency | Scale | Typical Technology |
|---|---|---|---|---|
| Data Ingestion | Collect raw data from sources | Minutes to hours | Terabytes/day | Kafka, Kinesis, Airflow |
| Feature Engineering | Transform raw data into features | Minutes to real-time | High compute | Spark, Flink, dbt |
| Training Pipeline | Prepare data for model training | Hours (batch) | Hundreds of GB | TFX, Kubeflow, MLflow |
| Serving Pipeline | Compute features for inference | Milliseconds to seconds | High throughput | Feast, Redis, custom |
The Pipeline Integration Challenge
These pipelines don't exist in isolation—they form a complex dependency graph where upstream changes cascade downstream. A schema change in data ingestion can break feature engineering, which invalidates training data, which requires model retraining, which affects serving. Managing these dependencies is a core challenge of ML systems engineering.
Data Pipeline Lifecycle Stages:
In production ML systems, data pipeline code often constitutes 80% of the total codebase while the actual ML model code is only 20%. Most 'ML engineering' is really data engineering. Teams that underinvest in pipeline design pay the price in technical debt, debugging overhead, and unreliable models.
The fundamental architectural decision in ML data pipelines is choosing between batch and streaming processing—or more commonly, determining the right hybrid of both. Each paradigm has distinct characteristics, trade-offs, and appropriate use cases.
The Lambda Architecture
Many ML systems adopt the Lambda Architecture, which combines batch and streaming layers:
┌──────────────────────────────────────┐
│ Unified Serving Layer │
└─────────────┬──────────────┬─────────┘
│ │
┌─────────────────┴──┐ ┌─────┴───────────────┐
│ Batch Layer │ │ Speed Layer │
│ (Complete, Slower) │ │ (Partial, Faster) │
└─────────────────┬──┘ └─────┬───────────────┘
│ │
Historical Data Live Events
Lambda Architecture Trade-offs:
The Kappa Architecture
The Kappa Architecture simplifies by treating everything as a stream:
┌──────────────────────────────────────┐
│ Unified Serving Layer │
└─────────────────┬────────────────────┘
│
┌─────────────────┴────────────────────┐
│ Stream Processing Layer │
│ (Single Processing Logic) │
└─────────────────┬────────────────────┘
│
┌─────────────────┴────────────────────┐
│ Append-Only Event Log │
│ (Full History) │
└──────────────────────────────────────┘
When to Choose Each:
| Choose Lambda | Choose Kappa |
|---|---|
| Complex batch-only features | Simpler feature logic |
| Need to reprocess with different logic | Logic won't change often |
| Streaming tech can't handle full history | Have mature streaming infrastructure |
| Different feature logic for batch vs. real-time | Same logic applies everywhere |
Most ML systems should start with batch-only pipelines. Streaming adds significant complexity and cost. Only add streaming when you have a clear business requirement for real-time features—not just because it seems more sophisticated. Many successful ML systems operate entirely in batch mode.
Feature engineering transforms raw data into the representations that ML models consume. The infrastructure supporting this transformation must balance expressiveness, performance, and maintainability.
Feature Computation Patterns:
| Pattern | Description | Example | Infrastructure Need |
|---|---|---|---|
| Point-in-time lookup | Single value at inference time | User's current city | Key-value store with fast reads |
| Aggregate over history | Aggregation over time window | Purchases in last 30 days | Pre-computed aggregates or streaming |
| Join across entities | Combine data from multiple sources | User features + item features | Wide tables or efficient joins |
| Embedding lookup | Dense vector representation | Word embedding, user embedding | Vector store optimized for lookups |
| Real-time computation | Calculated at request time | Distance from user to item | Low-latency compute in serving path |
The Feature Definition Language Challenge
As feature engineering complexity grows, teams need ways to define features declaratively rather than imperatively. Feature definition languages enable:
Example: Declarative Feature Definition
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
# Example feature definitions using a declarative approach entities: - name: user join_key: user_id description: "Platform user" - name: product join_key: product_id description: "Product in the catalog" features: - name: user_purchase_count_30d entity: user description: "Number of purchases in the last 30 days" aggregation: function: count source: purchases window: 30d value_type: int64 tags: [behavioral, engagement] - name: user_avg_order_value_90d entity: user description: "Average order value in the last 90 days" aggregation: function: mean source: orders.total_amount window: 90d value_type: float64 tags: [monetary, engagement] - name: product_embedding entity: product description: "Dense embedding from product2vec model" computation: function: lookup_embedding model: product2vec_v3 dimension: 128 value_type: float64_vector tags: [embedding, product] - name: user_product_affinity entities: [user, product] description: "Cosine similarity between user and product embeddings" computation: function: cosine_similarity args: - user.user_embedding - product.product_embedding value_type: float64 online_compute: true # Computed at serving time tags: [interaction, prediction]Feature stores (like Feast, Tecton, or Amazon SageMaker Feature Store) implement these concepts as managed infrastructure. They handle storage, serving, versioning, and consistency guarantees—allowing ML engineers to focus on feature logic rather than infrastructure. We'll cover feature stores in detail in the Feature Engineering chapter.
Training-serving skew is one of the most pernicious problems in production ML. It occurs when the features used during model training differ—subtly or dramatically—from the features computed at serving time. The model was trained on one data distribution but serves predictions on another.
This is particularly dangerous because:
Detection Strategies
Proactive monitoring for training-serving skew requires multi-layered detection:
| Method | What It Catches | Implementation |
|---|---|---|
| Unit Tests | Code path differences | Test feature functions with identical inputs, compare outputs |
| Statistical Monitoring | Distribution drift | Compare feature distributions between training and serving |
| Prediction Logging | Model input differences | Log serving inputs, compare to training data statistics |
| Dual Path Validation | Implementation bugs | Run training and serving code on same inputs, compare |
| Shadow Mode Testing | End-to-end skew | Serve model predictions without acting on them, measure offline |
Prevention Strategies
1. Unified Feature Logic The most robust prevention is using identical code for training and serving:
# Single feature definition used in both contexts
def compute_user_features(user_data: UserData) -> FeatureVector:
return FeatureVector(
purchase_count_30d=sum(1 for p in user_data.purchases
if p.date > today() - days(30)),
avg_order_value_90d=mean(p.amount for p in user_data.purchases
if p.date > today() - days(90)),
# ... more features
)
# Training: compute_user_features(historical_user_data)
# Serving: compute_user_features(current_user_data)
2. Feature Store Architecture Use a feature store that ensures batch-computed features are served identically:
┌─────────────────────────────────────────────────────────────┐
│ Feature Store │
│ ┌──────────────────────────┐ ┌──────────────────────────┐ │
│ │ Offline Store │ │ Online Store │ │
│ │ (Historical features) │──│ (Low-latency serving) │ │
│ └────────────┬─────────────┘ └────────────┬─────────────┘ │
└───────────────┼─────────────────────────────┼───────────────┘
│ │
Training Data Serving Requests
3. Point-in-Time Joins For historical training data, ensure features are computed as they would have been at prediction time:
-- Correct: Point-in-time join
SELECT
prediction_request.user_id,
prediction_request.timestamp,
features.feature_value
FROM prediction_request
LEFT JOIN features
ON prediction_request.user_id = features.user_id
AND features.computed_at <= prediction_request.timestamp
AND features.computed_at > prediction_request.timestamp - INTERVAL '1 day'
-- Incorrect: Uses latest features (data leakage)
SELECT
prediction_request.user_id,
prediction_request.timestamp,
features.feature_value -- This includes future information!
FROM prediction_request
LEFT JOIN features
ON prediction_request.user_id = features.user_id
Training-serving skew often causes models to degrade slowly over time, making it hard to attribute to a root cause. By the time you notice poor model performance, the skew may have accumulated through multiple sources. Invest in monitoring and prevention upfront—it's far cheaper than debugging in production.
Data quality issues propagate through the entire ML system. Bad data leads to bad features, which leads to bad models, which leads to bad predictions. Unlike traditional software where bad input might cause obvious errors, ML systems often silently produce degraded results from bad data.
The Data Validation Pyramid:
Effective data validation operates at multiple levels, each catching different types of issues:
12345678910111213141516171819202122232425
# Example: Great Expectations-style data validationfrom great_expectations import expect # Schema validationexpect(df.columns).to_contain(["user_id", "timestamp", "amount", "category"])expect(df["user_id"]).to_be_of_type("string")expect(df["amount"]).to_be_of_type("float64") # Semantic validationexpect(df["amount"]).to_be_between(0, 100000)expect(df["category"]).to_be_in_set(["electronics", "clothing", "food", "other"])expect(df["timestamp"]).to_be_parseable_as_datetime()expect(df["timestamp"]).to_be_less_than(datetime.utcnow()) # Statistical validationexpect(df["amount"]).mean_to_be_between(50, 200)expect(df["amount"]).standard_deviation_to_be_less_than(500)expect(df["category"]).unique_value_count_to_be_between(3, 10) # Null validationexpect(df["user_id"]).to_have_no_nulls()expect(df["amount"]).null_ratio_to_be_less_than(0.01) # Referential integrityexpect(df["user_id"]).values_to_be_in(valid_user_ids)Expectation-Based Data Contracts
Data contracts formalize expectations between data producers and consumers. For ML pipelines, this means:
Handling Data Quality Issues:
| Issue Type | Detection | Response Options |
|---|---|---|
| Missing data | Null counts, completeness | Impute, exclude, fail pipeline |
| Out of range | Value boundaries | Cap, exclude, flag for review |
| Schema change | Schema comparison | Adapt, reject, alert |
| Distribution shift | Statistical tests | Alert, retrain model, investigate |
| Duplicate records | Deduplication checks | Deduplicate, flag source issue |
| Timestamp issues | Ordering, gaps | Fill gaps, reject, alert |
The Fail-Fast vs. Fail-Safe Tradeoff:
The right choice depends on the cost of bad predictions versus the cost of no predictions. For a recommendation system, serving slightly degraded recommendations is better than serving nothing. For a fraud detection system, it might be better to fail and rely on fallback rules.
Define explicit SLAs for data quality metrics—not just availability. For example: '99% of records must pass all validations, with null rate < 1% for critical fields.' Track these SLAs alongside traditional data pipeline SLAs (freshness, completeness, latency).
ML data pipelines consist of many interdependent tasks—data extraction, validation, transformation, feature computation, model training, and deployment. Orchestrating these tasks reliably, efficiently, and transparently is the job of pipeline orchestration systems.
Core Orchestration Concerns:
DAG-Based Orchestration
Most orchestrators model pipelines as Directed Acyclic Graphs (DAGs), where nodes are tasks and edges are dependencies:
┌─────────────┐
│ Extract Data │
└──────┬──────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────┐ ┌───────────┐
│ Validate Data │ │ Clean Data│ │ Log Stats │
└───────┬───────┘ └─────┬─────┘ └───────────┘
│ │
└────────┬───────┘
▼
┌───────────────┐
│ Compute Features│
└───────┬───────┘
│
┌──────────┼──────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Train │ │ Validate│ │ Update │
│ Model │ │ Features│ │ Store │
└────┬────┘ └─────────┘ └─────────┘
│
▼
┌─────────┐
│ Deploy │
└─────────┘
Popular Orchestration Tools:
| Tool | Strengths | Best For |
|---|---|---|
| Apache Airflow | Mature, extensive integrations, Python-native | Complex ETL, established data teams |
| Prefect | Modern, cloud-native, dynamic DAGs | Growing teams, hybrid cloud |
| Dagster | Type-safe, testable, asset-centric | ML-heavy workflows, data quality focus |
| Kubeflow Pipelines | Kubernetes-native, ML-specific | ML training pipelines on Kubernetes |
| AWS Step Functions | Serverless, AWS integration | AWS-centric, simple workflows |
| dbt | SQL-centric, declarative transformations | Analytics, feature engineering in SQL |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
# Example: Airflow DAG for ML training pipelinefrom airflow import DAGfrom airflow.operators.python import PythonOperatorfrom airflow.sensors.external_task import ExternalTaskSensorfrom datetime import datetime, timedelta default_args = { 'owner': 'ml-team', 'retries': 3, 'retry_delay': timedelta(minutes=5), 'execution_timeout': timedelta(hours=2),} with DAG( 'ml_training_pipeline', default_args=default_args, schedule_interval='@daily', start_date=datetime(2024, 1, 1), catchup=False, tags=['ml', 'training'],) as dag: # Wait for upstream data pipeline wait_for_data = ExternalTaskSensor( task_id='wait_for_data_pipeline', external_dag_id='data_ingestion', external_task_id='final_validation', mode='poke', timeout=3600, ) # Validate training data validate_data = PythonOperator( task_id='validate_training_data', python_callable=validate_training_data, ) # Compute features compute_features = PythonOperator( task_id='compute_features', python_callable=compute_training_features, pool='spark-pool', # Resource pool for Spark jobs ) # Train model train_model = PythonOperator( task_id='train_model', python_callable=train_model, pool='gpu-pool', # GPU resource pool execution_timeout=timedelta(hours=4), ) # Evaluate model evaluate_model = PythonOperator( task_id='evaluate_model', python_callable=evaluate_model, ) # Deploy if evaluation passes deploy_model = PythonOperator( task_id='deploy_model', python_callable=deploy_to_production, trigger_rule='all_success', ) # Define dependencies wait_for_data >> validate_data >> compute_features >> train_model >> evaluate_model >> deploy_modelOrchestration tools manage task execution and dependencies. Workflow tools often add higher-level abstractions like data lineage, versioning, and experiment tracking. For ML, you often need both: an orchestrator for reliable execution and an ML platform (MLflow, Weights & Biases, etc.) for experiment management.
Mature ML systems employ proven architectural patterns that balance scalability, reliability, and maintainability. Understanding these patterns enables informed design decisions.
Pattern 1: The Feature Platform
Centralize feature management to enable reuse and consistency:
┌─────────────────────────────────────────────────────────────────────┐
│ Feature Platform │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Feature Registry│ │ Feature Compute │ │ Feature Serving │ │
│ │ - Definitions │ │ - Batch ETL │ │ - Online Store │ │
│ │ - Lineage │ │ - Streaming │ │ - Point Lookup │ │
│ │ - Versioning │ │ - Validation │ │ - Batch Export │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ Team A │ │ Team B │ │ Team C │
│ Model 1 │ │ Model 2 │ │ Model 3 │
└─────────┘ └─────────┘ └─────────┘
Benefits:
Pattern 2: Event-Driven Feature Updates
Update features in response to business events rather than on fixed schedules:
Events (purchases, clicks, etc.)
│
▼
┌──────────────────────────────┐
│ Event Stream │
│ (Kafka, Kinesis, etc.) │
└──────────────┬───────────────┘
│
┌──────────────┼──────────────────────────┐
│ ▼ ▼ │
│ ┌────────────────┐ ┌───────────────┐ │
│ │ Stream │ │ Analytics │ │
│ │ Processor │ │ Consumer │ │
│ │ (Flink, etc.) │ │ │ │
│ └───────┬────────┘ └───────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Feature Store │ │
│ │ (immediate) │ │
│ └────────────────┘ │
└─────────────────────────────────────────┘
Benefits:
Pattern 3: Offline-Online Sync Pattern
Maintain consistency between offline (training) and online (serving) data stores:
┌────────────────────┐
│ Batch Feature Job │
└─────────┬──────────┘
│
┌────────────────┼────────────────┐
│ │ │
▼ │ ▼
┌──────────────┐ │ ┌──────────────┐
│ Offline Store │◄───────┴───────►│ Online Store │
│ (Data Lake) │ Sync Process │ (Redis) │
└───────┬───────┘ └──────┬───────┘
│ │
▼ ▼
Training Data Serving Requests
The sync process ensures that:
These patterns aren't mutually exclusive. Production systems often combine them: a Feature Platform as the foundation, Event-Driven Updates for real-time features, and Offline-Online Sync for consistency. Start with the simplest pattern that meets requirements and add complexity as proven necessary.
Data pipelines are the foundation upon which ML systems are built. Without robust, scalable, and consistent data infrastructure, even the most sophisticated models fail in production. The investment in pipeline design pays dividends throughout the system lifecycle.
What's next:
With data pipelines feeding reliable features, the next step is designing the model architecture itself. The next page explores Model Architecture—how to design, select, and structure the models that transform features into predictions, including considerations for complexity, interpretability, and production constraints.
You now understand how to design data pipelines for ML systems—from ingestion through serving, with strategies for batch and streaming processing, training-serving consistency, data quality, and orchestration. Next, we design the models that consume this data infrastructure.