Machine LearningML System Design

ML System Design

LevelAdvanced

Duration120 mins

TopicML System Design

2 / 5

Data Pipeline Design

The Lifeblood of ML Systems

If ML models are the brains of intelligent systems, data pipelines are the circulatory system—invisible but essential, moving the vital resources that keep everything alive. The most sophisticated model architecture is worthless without robust data infrastructure to feed it.

Data pipeline design for ML systems is fundamentally more complex than traditional ETL. ML pipelines must serve two masters simultaneously: training, which processes historical data to create models, and serving, which processes live data to power predictions. These two paths must produce identical feature values—any divergence creates training-serving skew, one of the most insidious bugs in production ML.

This page covers the engineering principles, architectural patterns, and practical considerations for building data pipelines that reliably transform raw data into ML-ready features at scale.

What You Will Learn

By the end of this page, you will understand how to design data pipelines for ML systems—from batch training pipelines to real-time feature serving, from feature stores to online-offline consistency guarantees. You'll learn to architect data infrastructure that enables reliable, scalable, and maintainable ML systems.

The ML Data Pipeline Landscape

ML data pipelines are not a single system but an ecosystem of interconnected components, each serving specific purposes in the model lifecycle. Understanding this landscape is the first step toward effective design.

The Four Pipeline Types:

ML Data Pipeline Types and Their Characteristics
Pipeline Type	Purpose	Latency	Scale	Typical Technology
Data Ingestion	Collect raw data from sources	Minutes to hours	Terabytes/day	Kafka, Kinesis, Airflow
Feature Engineering	Transform raw data into features	Minutes to real-time	High compute	Spark, Flink, dbt
Training Pipeline	Prepare data for model training	Hours (batch)	Hundreds of GB	TFX, Kubeflow, MLflow
Serving Pipeline	Compute features for inference	Milliseconds to seconds	High throughput	Feast, Redis, custom

The Pipeline Integration Challenge

These pipelines don't exist in isolation—they form a complex dependency graph where upstream changes cascade downstream. A schema change in data ingestion can break feature engineering, which invalidates training data, which requires model retraining, which affects serving. Managing these dependencies is a core challenge of ML systems engineering.

Data Pipeline Lifecycle Stages:

Collection: Raw data arrives from various sources—databases, APIs, event streams, files
Validation: Data is checked for quality, completeness, and schema compliance
Transformation: Raw data is cleaned, normalized, and prepared for feature engineering
Feature Engineering: Transformed data becomes ML features through aggregations, embeddings, etc.
Storage: Features are stored for training (offline) and serving (online)
Serving: Features are retrieved at prediction time and combined with real-time inputs

The 80/20 Reality

In production ML systems, data pipeline code often constitutes 80% of the total codebase while the actual ML model code is only 20%. Most 'ML engineering' is really data engineering. Teams that underinvest in pipeline design pay the price in technical debt, debugging overhead, and unreliable models.

Batch vs. Streaming Architecture

The fundamental architectural decision in ML data pipelines is choosing between batch and streaming processing—or more commonly, determining the right hybrid of both. Each paradigm has distinct characteristics, trade-offs, and appropriate use cases.

Batch Processing

•Latency: Minutes to hours
•Throughput: Very high (processes all data)
•Complexity: Lower (simpler programming model)
•Cost: Generally lower (scheduled resources)
•Correctness: Easier to ensure (can reprocess)
•Use cases: Training pipelines, daily features, reports

Streaming Processing

•Latency: Milliseconds to seconds
•Throughput: Continuous (processes as data arrives)
•Complexity: Higher (state management, ordering)
•Cost: Higher (always-on resources)
•Correctness: Harder (exactly-once semantics)
•Use cases: Real-time features, fraud detection, alerts

The Lambda Architecture

Many ML systems adopt the Lambda Architecture, which combines batch and streaming layers:

                   ┌──────────────────────────────────────┐
                   │         Unified Serving Layer         │
                   └─────────────┬──────────────┬─────────┘
                                 │              │
               ┌─────────────────┴──┐     ┌─────┴───────────────┐
               │    Batch Layer     │     │   Speed Layer       │
               │ (Complete, Slower) │     │ (Partial, Faster)   │
               └─────────────────┬──┘     └─────┬───────────────┘
                                 │              │
                     Historical Data       Live Events

Batch Layer: Processes all historical data to compute complete, accurate features
Speed Layer: Processes recent events in real-time, providing approximate features
Serving Layer: Merges batch and real-time views to answer queries

Lambda Architecture Trade-offs:

✅ Combines completeness of batch with freshness of streaming
✅ Batch layer can recompute to correct errors
❌ Requires maintaining two separate codebases
❌ Serving layer must reconcile potentially inconsistent views
❌ Operational complexity of running two systems

The Kappa Architecture

The Kappa Architecture simplifies by treating everything as a stream:

              ┌──────────────────────────────────────┐
              │         Unified Serving Layer         │
              └─────────────────┬────────────────────┘
                                │
              ┌─────────────────┴────────────────────┐
              │        Stream Processing Layer        │
              │     (Single Processing Logic)         │
              └─────────────────┬────────────────────┘
                                │
              ┌─────────────────┴────────────────────┐
              │        Append-Only Event Log          │
              │         (Full History)                │
              └──────────────────────────────────────┘

Single processing logic for both batch and real-time
Batch processing = replay events from the beginning
Real-time processing = process new events as they arrive
Eliminates dual-codebase maintenance

When to Choose Each:

Choose Lambda	Choose Kappa
Complex batch-only features	Simpler feature logic
Need to reprocess with different logic	Logic won't change often
Streaming tech can't handle full history	Have mature streaming infrastructure
Different feature logic for batch vs. real-time	Same logic applies everywhere

Start Simple

Most ML systems should start with batch-only pipelines. Streaming adds significant complexity and cost. Only add streaming when you have a clear business requirement for real-time features—not just because it seems more sophisticated. Many successful ML systems operate entirely in batch mode.

Feature Engineering Infrastructure

Feature engineering transforms raw data into the representations that ML models consume. The infrastructure supporting this transformation must balance expressiveness, performance, and maintainability.

Feature Computation Patterns:

Common Feature Computation Patterns
Pattern	Description	Example	Infrastructure Need
Point-in-time lookup	Single value at inference time	User's current city	Key-value store with fast reads
Aggregate over history	Aggregation over time window	Purchases in last 30 days	Pre-computed aggregates or streaming
Join across entities	Combine data from multiple sources	User features + item features	Wide tables or efficient joins
Embedding lookup	Dense vector representation	Word embedding, user embedding	Vector store optimized for lookups
Real-time computation	Calculated at request time	Distance from user to item	Low-latency compute in serving path

The Feature Definition Language Challenge

As feature engineering complexity grows, teams need ways to define features declaratively rather than imperatively. Feature definition languages enable:

Consistency: Same definition generates batch and streaming implementations
Reusability: Features can be shared across models
Versioning: Feature definitions can be tracked and rolled back
Documentation: Definitions serve as living documentation
Automation: Infrastructure can be automatically provisioned

Example: Declarative Feature Definition

feature_definitions.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# Example feature definitions using a declarative approach
 
entities:
  - name: user
    join_key: user_id
    description: "Platform user"
    
  - name: product
    join_key: product_id
    description: "Product in the catalog"
 
features:
  - name: user_purchase_count_30d
    entity: user
    description: "Number of purchases in the last 30 days"
    aggregation:
      function: count
      source: purchases
      window: 30d
    value_type: int64
    tags: [behavioral, engagement]
    
  - name: user_avg_order_value_90d
    entity: user
    description: "Average order value in the last 90 days"
    aggregation:
      function: mean
      source: orders.total_amount
      window: 90d
    value_type: float64
    tags: [monetary, engagement]
    
  - name: product_embedding
    entity: product
    description: "Dense embedding from product2vec model"
    computation:
      function: lookup_embedding
      model: product2vec_v3
      dimension: 128
    value_type: float64_vector
    tags: [embedding, product]
 
  - name: user_product_affinity
    entities: [user, product]
    description: "Cosine similarity between user and product embeddings"
    computation:
      function: cosine_similarity
      args:
        - user.user_embedding
        - product.product_embedding
    value_type: float64
    online_compute: true  # Computed at serving time
    tags: [interaction, prediction]

Feature Stores

Feature stores (like Feast, Tecton, or Amazon SageMaker Feature Store) implement these concepts as managed infrastructure. They handle storage, serving, versioning, and consistency guarantees—allowing ML engineers to focus on feature logic rather than infrastructure. We'll cover feature stores in detail in the Feature Engineering chapter.

Training-Serving Skew

Training-serving skew is one of the most pernicious problems in production ML. It occurs when the features used during model training differ—subtly or dramatically—from the features computed at serving time. The model was trained on one data distribution but serves predictions on another.

This is particularly dangerous because:

Models may still produce predictions (no obvious errors)
Performance degradation can be gradual and hard to detect
The root cause is often far from the observed problem
Traditional software testing often misses these issues

Sources of Training-Serving Skew

•Different Code Paths: Training features computed in Spark; serving features computed in Python. Subtle implementation differences cause skew.
•Time-Based Features: Training uses complete daily aggregates; serving uses partial day data. '30-day average' means different things.
•Data Leakage: Training accidentally includes information from the future. Model performs well in training but fails in production.
•Preprocessing Differences: Different tokenization, normalization, or encoding between training and serving.
•Missing Value Handling: Training imputes missing values with historical statistics; serving uses different defaults.
•Feature Freshness: Training uses data from hours ago; serving uses real-time data with different characteristics.
•Schema Drift: Source data schema changes after training data was generated.
•Null Semantics: Null values have different meanings in different systems (missing vs. zero vs. default).

Detection Strategies

Proactive monitoring for training-serving skew requires multi-layered detection:

Training-Serving Skew Detection Methods
Method	What It Catches	Implementation
Unit Tests	Code path differences	Test feature functions with identical inputs, compare outputs
Statistical Monitoring	Distribution drift	Compare feature distributions between training and serving
Prediction Logging	Model input differences	Log serving inputs, compare to training data statistics
Dual Path Validation	Implementation bugs	Run training and serving code on same inputs, compare
Shadow Mode Testing	End-to-end skew	Serve model predictions without acting on them, measure offline

Prevention Strategies

1. Unified Feature Logic The most robust prevention is using identical code for training and serving:

# Single feature definition used in both contexts
def compute_user_features(user_data: UserData) -> FeatureVector:
    return FeatureVector(
        purchase_count_30d=sum(1 for p in user_data.purchases 
                               if p.date > today() - days(30)),
        avg_order_value_90d=mean(p.amount for p in user_data.purchases 
                                  if p.date > today() - days(90)),
        # ... more features
    )

# Training: compute_user_features(historical_user_data)
# Serving: compute_user_features(current_user_data)

2. Feature Store Architecture Use a feature store that ensures batch-computed features are served identically:

┌─────────────────────────────────────────────────────────────┐
│                      Feature Store                          │
│  ┌──────────────────────────┐  ┌──────────────────────────┐ │
│  │    Offline Store         │  │    Online Store          │ │
│  │ (Historical features)    │──│ (Low-latency serving)    │ │
│  └────────────┬─────────────┘  └────────────┬─────────────┘ │
└───────────────┼─────────────────────────────┼───────────────┘
                │                             │
        Training Data                  Serving Requests

3. Point-in-Time Joins For historical training data, ensure features are computed as they would have been at prediction time:

-- Correct: Point-in-time join
SELECT 
    prediction_request.user_id,
    prediction_request.timestamp,
    features.feature_value
FROM prediction_request
LEFT JOIN features
    ON prediction_request.user_id = features.user_id
    AND features.computed_at <= prediction_request.timestamp
    AND features.computed_at > prediction_request.timestamp - INTERVAL '1 day'

-- Incorrect: Uses latest features (data leakage)
SELECT 
    prediction_request.user_id,
    prediction_request.timestamp,
    features.feature_value  -- This includes future information!
FROM prediction_request
LEFT JOIN features
    ON prediction_request.user_id = features.user_id

The Subtle Killer

Training-serving skew often causes models to degrade slowly over time, making it hard to attribute to a root cause. By the time you notice poor model performance, the skew may have accumulated through multiple sources. Invest in monitoring and prevention upfront—it's far cheaper than debugging in production.

Data Validation and Quality

Data quality issues propagate through the entire ML system. Bad data leads to bad features, which leads to bad models, which leads to bad predictions. Unlike traditional software where bad input might cause obvious errors, ML systems often silently produce degraded results from bad data.

The Data Validation Pyramid:

Effective data validation operates at multiple levels, each catching different types of issues:

Data Validation Levels

•Schema Validation — Do columns exist? Are types correct? Are constraints satisfied? (e.g., non-null, unique, foreign key)
•Semantic Validation — Do values make sense? Is age between 0 and 150? Are prices positive? Are dates not in the future?
•Statistical Validation — Are distributions within expected bounds? Has the mean shifted? Are there unexpected cardinality changes?
•Cross-Table Validation — Do joins produce expected row counts? Are foreign key relationships intact?
•Temporal Validation — Are there gaps in time series? Is data arriving on schedule? Are timestamps monotonically increasing?

data_validation_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Example: Great Expectations-style data validation
from great_expectations import expect
 
# Schema validation
expect(df.columns).to_contain(["user_id", "timestamp", "amount", "category"])
expect(df["user_id"]).to_be_of_type("string")
expect(df["amount"]).to_be_of_type("float64")
 
# Semantic validation
expect(df["amount"]).to_be_between(0, 100000)
expect(df["category"]).to_be_in_set(["electronics", "clothing", "food", "other"])
expect(df["timestamp"]).to_be_parseable_as_datetime()
expect(df["timestamp"]).to_be_less_than(datetime.utcnow())
 
# Statistical validation
expect(df["amount"]).mean_to_be_between(50, 200)
expect(df["amount"]).standard_deviation_to_be_less_than(500)
expect(df["category"]).unique_value_count_to_be_between(3, 10)
 
# Null validation
expect(df["user_id"]).to_have_no_nulls()
expect(df["amount"]).null_ratio_to_be_less_than(0.01)
 
# Referential integrity
expect(df["user_id"]).values_to_be_in(valid_user_ids)

Expectation-Based Data Contracts

Data contracts formalize expectations between data producers and consumers. For ML pipelines, this means:

Input Contracts: What the pipeline expects from upstream sources
Output Contracts: What the pipeline guarantees to downstream consumers
Evolution Rules: How contracts can change over time

Handling Data Quality Issues:

Issue Type	Detection	Response Options
Missing data	Null counts, completeness	Impute, exclude, fail pipeline
Out of range	Value boundaries	Cap, exclude, flag for review
Schema change	Schema comparison	Adapt, reject, alert
Distribution shift	Statistical tests	Alert, retrain model, investigate
Duplicate records	Deduplication checks	Deduplicate, flag source issue
Timestamp issues	Ordering, gaps	Fill gaps, reject, alert

The Fail-Fast vs. Fail-Safe Tradeoff:

Fail-fast: Pipeline halts on data quality violations. Ensures quality but risks availability.
Fail-safe: Pipeline continues with degraded data, logging issues. Maintains availability but risks quality.

The right choice depends on the cost of bad predictions versus the cost of no predictions. For a recommendation system, serving slightly degraded recommendations is better than serving nothing. For a fraud detection system, it might be better to fail and rely on fallback rules.

Data Quality SLAs

Define explicit SLAs for data quality metrics—not just availability. For example: '99% of records must pass all validations, with null rate < 1% for critical fields.' Track these SLAs alongside traditional data pipeline SLAs (freshness, completeness, latency).

Pipeline Orchestration

ML data pipelines consist of many interdependent tasks—data extraction, validation, transformation, feature computation, model training, and deployment. Orchestrating these tasks reliably, efficiently, and transparently is the job of pipeline orchestration systems.

Core Orchestration Concerns:

Orchestration Requirements

•Dependency Management: Tasks must execute in correct order. Downstream tasks wait for upstream completion.
•Scheduling: Pipelines run on schedules (daily, hourly) or in response to triggers (new data, model retrain).
•Failure Handling: Tasks can fail. Orchestrator must retry, skip, or alert depending on policy.
•Idempotency: Rerunning a pipeline with the same inputs should produce the same outputs.
•Monitoring: Visibility into task status, duration, resource usage, and failure reasons.
•Backfill: Ability to reprocess historical data without disrupting ongoing pipelines.
•Scalability: Handle pipelines with hundreds of tasks processing terabytes of data.

DAG-Based Orchestration

Most orchestrators model pipelines as Directed Acyclic Graphs (DAGs), where nodes are tasks and edges are dependencies:

                    ┌─────────────┐
                    │ Extract Data │
                    └──────┬──────┘
                           │
            ┌──────────────┼──────────────┐
            ▼              ▼              ▼
    ┌───────────────┐  ┌───────────┐  ┌───────────┐
    │ Validate Data │  │ Clean Data│  │ Log Stats │
    └───────┬───────┘  └─────┬─────┘  └───────────┘
            │                │
            └────────┬───────┘
                     ▼
            ┌───────────────┐
            │ Compute Features│
            └───────┬───────┘
                    │
         ┌──────────┼──────────┐
         ▼          ▼          ▼
    ┌─────────┐  ┌─────────┐  ┌─────────┐
    │ Train   │  │ Validate│  │ Update  │
    │ Model   │  │ Features│  │ Store   │
    └────┬────┘  └─────────┘  └─────────┘
         │
         ▼
    ┌─────────┐
    │ Deploy  │
    └─────────┘

Popular Orchestration Tools:

Tool	Strengths	Best For
Apache Airflow	Mature, extensive integrations, Python-native	Complex ETL, established data teams
Prefect	Modern, cloud-native, dynamic DAGs	Growing teams, hybrid cloud
Dagster	Type-safe, testable, asset-centric	ML-heavy workflows, data quality focus
Kubeflow Pipelines	Kubernetes-native, ML-specific	ML training pipelines on Kubernetes
AWS Step Functions	Serverless, AWS integration	AWS-centric, simple workflows
dbt	SQL-centric, declarative transformations	Analytics, feature engineering in SQL

ml_pipeline_dag.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# Example: Airflow DAG for ML training pipeline
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.sensors.external_task import ExternalTaskSensor
from datetime import datetime, timedelta
 
default_args = {
    'owner': 'ml-team',
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'execution_timeout': timedelta(hours=2),
}
 
with DAG(
    'ml_training_pipeline',
    default_args=default_args,
    schedule_interval='@daily',
    start_date=datetime(2024, 1, 1),
    catchup=False,
    tags=['ml', 'training'],
) as dag:
    
    # Wait for upstream data pipeline
    wait_for_data = ExternalTaskSensor(
        task_id='wait_for_data_pipeline',
        external_dag_id='data_ingestion',
        external_task_id='final_validation',
        mode='poke',
        timeout=3600,
    )
    
    # Validate training data
    validate_data = PythonOperator(
        task_id='validate_training_data',
        python_callable=validate_training_data,
    )
    
    # Compute features
    compute_features = PythonOperator(
        task_id='compute_features',
        python_callable=compute_training_features,
        pool='spark-pool',  # Resource pool for Spark jobs
    )
    
    # Train model
    train_model = PythonOperator(
        task_id='train_model',
        python_callable=train_model,
        pool='gpu-pool',  # GPU resource pool
        execution_timeout=timedelta(hours=4),
    )
    
    # Evaluate model
    evaluate_model = PythonOperator(
        task_id='evaluate_model',
        python_callable=evaluate_model,
    )
    
    # Deploy if evaluation passes
    deploy_model = PythonOperator(
        task_id='deploy_model',
        python_callable=deploy_to_production,
        trigger_rule='all_success',
    )
    
    # Define dependencies
    wait_for_data >> validate_data >> compute_features >> train_model >> evaluate_model >> deploy_model

Orchestration vs. Workflow

Orchestration tools manage task execution and dependencies. Workflow tools often add higher-level abstractions like data lineage, versioning, and experiment tracking. For ML, you often need both: an orchestrator for reliable execution and an ML platform (MLflow, Weights & Biases, etc.) for experiment management.

Production Architecture Patterns

Mature ML systems employ proven architectural patterns that balance scalability, reliability, and maintainability. Understanding these patterns enables informed design decisions.

Pattern 1: The Feature Platform

Centralize feature management to enable reuse and consistency:

┌─────────────────────────────────────────────────────────────────────┐
│                        Feature Platform                              │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐     │
│  │ Feature Registry│  │ Feature Compute │  │ Feature Serving │     │
│  │  - Definitions  │  │  - Batch ETL    │  │  - Online Store │     │
│  │  - Lineage      │  │  - Streaming    │  │  - Point Lookup │     │
│  │  - Versioning   │  │  - Validation   │  │  - Batch Export │     │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘     │
└───────────────────────────────────────────────────────────────────┘
         │                        │                      │
    ┌────┴────┐              ┌────┴────┐            ┌────┴────┐
    │ Team A  │              │ Team B  │            │ Team C  │
    │ Model 1 │              │ Model 2 │            │ Model 3 │
    └─────────┘              └─────────┘            └─────────┘

Benefits:

Feature reuse across teams and models
Consistent training-serving feature computation
Centralized governance and quality monitoring
Reduced duplicate data engineering effort

Pattern 2: Event-Driven Feature Updates

Update features in response to business events rather than on fixed schedules:

    Events (purchases, clicks, etc.)
                    │
                    ▼
    ┌──────────────────────────────┐
    │        Event Stream          │
    │    (Kafka, Kinesis, etc.)    │
    └──────────────┬───────────────┘
                   │
    ┌──────────────┼──────────────────────────┐
    │              ▼              ▼           │
    │  ┌────────────────┐  ┌───────────────┐  │
    │  │  Stream        │  │ Analytics     │  │
    │  │  Processor     │  │ Consumer      │  │
    │  │  (Flink, etc.) │  │               │  │
    │  └───────┬────────┘  └───────────────┘  │
    │          │                              │
    │          ▼                              │
    │  ┌────────────────┐                     │
    │  │ Feature Store  │                     │
    │  │ (immediate)    │                     │
    │  └────────────────┘                     │
    └─────────────────────────────────────────┘

Benefits:

Features update as soon as relevant events occur
Naturally handles high-velocity data
Easier to reason about feature freshness
Efficient resource usage (process only when events occur)

Pattern 3: Offline-Online Sync Pattern

Maintain consistency between offline (training) and online (serving) data stores:

                   ┌────────────────────┐
                   │ Batch Feature Job  │
                   └─────────┬──────────┘
                             │
            ┌────────────────┼────────────────┐
            │                │                │
            ▼                │                ▼
    ┌──────────────┐         │        ┌──────────────┐
    │ Offline Store │◄───────┴───────►│ Online Store  │
    │  (Data Lake)  │   Sync Process  │   (Redis)    │
    └───────┬───────┘                 └──────┬───────┘
            │                                │
            ▼                                ▼
      Training Data                   Serving Requests

The sync process ensures that:

Online store contains the same feature values as offline
Point-in-time correctness is maintained
Failed syncs are detected and alerted

Pattern Selection

These patterns aren't mutually exclusive. Production systems often combine them: a Feature Platform as the foundation, Event-Driven Updates for real-time features, and Offline-Online Sync for consistency. Start with the simplest pattern that meets requirements and add complexity as proven necessary.

Summary: Data Pipeline Design Mastery

Data pipelines are the foundation upon which ML systems are built. Without robust, scalable, and consistent data infrastructure, even the most sophisticated models fail in production. The investment in pipeline design pays dividends throughout the system lifecycle.

Key Takeaways

•Data pipelines are 80% of ML engineering — Most code in ML systems is data infrastructure, not model code. Invest accordingly.
•Choose batch vs. streaming deliberately — Start with batch unless you have clear real-time requirements. Streaming adds significant complexity.
•Training-serving skew is insidious — Use unified feature logic, feature stores, and point-in-time joins to prevent skew.
•Data validation is essential — Validate at multiple levels: schema, semantic, statistical, and temporal. Define data quality SLAs.
•Orchestration enables reliability — Use DAG-based orchestration with proper failure handling, idempotency, and monitoring.
•Adopt proven architectural patterns — Feature platforms, event-driven updates, and offline-online sync solve common challenges.
•Design for evolution — Data sources change, features evolve, and models are retrained. Build flexibility into your architecture.

What's next:

With data pipelines feeding reliable features, the next step is designing the model architecture itself. The next page explores Model Architecture—how to design, select, and structure the models that transform features into predictions, including considerations for complexity, interpretability, and production constraints.

Page Complete

You now understand how to design data pipelines for ML systems—from ingestion through serving, with strategies for batch and streaming processing, training-serving consistency, data quality, and orchestration. Next, we design the models that consume this data infrastructure.

2 / 5

Loading learning content...

Machine LearningML System Design

ML System Design

LevelAdvanced

Duration120 mins

TopicML System Design

2 / 5

Data Pipeline Design

The Lifeblood of ML Systems

This page covers the engineering principles, architectural patterns, and practical considerations for building data pipelines that reliably transform raw data into ML-ready features at scale.

What You Will Learn

The ML Data Pipeline Landscape

The Four Pipeline Types:

ML Data Pipeline Types and Their Characteristics
Pipeline Type	Purpose	Latency	Scale	Typical Technology
Data Ingestion	Collect raw data from sources	Minutes to hours	Terabytes/day	Kafka, Kinesis, Airflow
Feature Engineering	Transform raw data into features	Minutes to real-time	High compute	Spark, Flink, dbt
Training Pipeline	Prepare data for model training	Hours (batch)	Hundreds of GB	TFX, Kubeflow, MLflow
Serving Pipeline	Compute features for inference	Milliseconds to seconds	High throughput	Feast, Redis, custom

The Pipeline Integration Challenge

Data Pipeline Lifecycle Stages:

Collection: Raw data arrives from various sources—databases, APIs, event streams, files
Validation: Data is checked for quality, completeness, and schema compliance
Transformation: Raw data is cleaned, normalized, and prepared for feature engineering
Feature Engineering: Transformed data becomes ML features through aggregations, embeddings, etc.
Storage: Features are stored for training (offline) and serving (online)
Serving: Features are retrieved at prediction time and combined with real-time inputs

The 80/20 Reality

Batch vs. Streaming Architecture

Batch Processing

•Latency: Minutes to hours
•Throughput: Very high (processes all data)
•Complexity: Lower (simpler programming model)
•Cost: Generally lower (scheduled resources)
•Correctness: Easier to ensure (can reprocess)
•Use cases: Training pipelines, daily features, reports

Streaming Processing

•Latency: Milliseconds to seconds
•Throughput: Continuous (processes as data arrives)
•Complexity: Higher (state management, ordering)
•Cost: Higher (always-on resources)
•Correctness: Harder (exactly-once semantics)
•Use cases: Real-time features, fraud detection, alerts

The Lambda Architecture

Many ML systems adopt the Lambda Architecture, which combines batch and streaming layers:

                   ┌──────────────────────────────────────┐
                   │         Unified Serving Layer         │
                   └─────────────┬──────────────┬─────────┘
                                 │              │
               ┌─────────────────┴──┐     ┌─────┴───────────────┐
               │    Batch Layer     │     │   Speed Layer       │
               │ (Complete, Slower) │     │ (Partial, Faster)   │
               └─────────────────┬──┘     └─────┬───────────────┘
                                 │              │
                     Historical Data       Live Events

Batch Layer: Processes all historical data to compute complete, accurate features
Speed Layer: Processes recent events in real-time, providing approximate features
Serving Layer: Merges batch and real-time views to answer queries

Lambda Architecture Trade-offs:

✅ Combines completeness of batch with freshness of streaming
✅ Batch layer can recompute to correct errors
❌ Requires maintaining two separate codebases
❌ Serving layer must reconcile potentially inconsistent views
❌ Operational complexity of running two systems

The Kappa Architecture

The Kappa Architecture simplifies by treating everything as a stream:

              ┌──────────────────────────────────────┐
              │         Unified Serving Layer         │
              └─────────────────┬────────────────────┘
                                │
              ┌─────────────────┴────────────────────┐
              │        Stream Processing Layer        │
              │     (Single Processing Logic)         │
              └─────────────────┬────────────────────┘
                                │
              ┌─────────────────┴────────────────────┐
              │        Append-Only Event Log          │
              │         (Full History)                │
              └──────────────────────────────────────┘

Single processing logic for both batch and real-time
Batch processing = replay events from the beginning
Real-time processing = process new events as they arrive
Eliminates dual-codebase maintenance

When to Choose Each:

Choose Lambda	Choose Kappa
Complex batch-only features	Simpler feature logic
Need to reprocess with different logic	Logic won't change often
Streaming tech can't handle full history	Have mature streaming infrastructure
Different feature logic for batch vs. real-time	Same logic applies everywhere

Start Simple

Feature Engineering Infrastructure

Feature Computation Patterns:

Common Feature Computation Patterns
Pattern	Description	Example	Infrastructure Need
Point-in-time lookup	Single value at inference time	User's current city	Key-value store with fast reads
Aggregate over history	Aggregation over time window	Purchases in last 30 days	Pre-computed aggregates or streaming
Join across entities	Combine data from multiple sources	User features + item features	Wide tables or efficient joins
Embedding lookup	Dense vector representation	Word embedding, user embedding	Vector store optimized for lookups
Real-time computation	Calculated at request time	Distance from user to item	Low-latency compute in serving path

The Feature Definition Language Challenge

As feature engineering complexity grows, teams need ways to define features declaratively rather than imperatively. Feature definition languages enable:

Consistency: Same definition generates batch and streaming implementations
Reusability: Features can be shared across models
Versioning: Feature definitions can be tracked and rolled back
Documentation: Definitions serve as living documentation
Automation: Infrastructure can be automatically provisioned

Example: Declarative Feature Definition

feature_definitions.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# Example feature definitions using a declarative approach
 
entities:
  - name: user
    join_key: user_id
    description: "Platform user"
    
  - name: product
    join_key: product_id
    description: "Product in the catalog"
 
features:
  - name: user_purchase_count_30d
    entity: user
    description: "Number of purchases in the last 30 days"
    aggregation:
      function: count
      source: purchases
      window: 30d
    value_type: int64
    tags: [behavioral, engagement]
    
  - name: user_avg_order_value_90d
    entity: user
    description: "Average order value in the last 90 days"
    aggregation:
      function: mean
      source: orders.total_amount
      window: 90d
    value_type: float64
    tags: [monetary, engagement]
    
  - name: product_embedding
    entity: product
    description: "Dense embedding from product2vec model"
    computation:
      function: lookup_embedding
      model: product2vec_v3
      dimension: 128
    value_type: float64_vector
    tags: [embedding, product]
 
  - name: user_product_affinity
    entities: [user, product]
    description: "Cosine similarity between user and product embeddings"
    computation:
      function: cosine_similarity
      args:
        - user.user_embedding
        - product.product_embedding
    value_type: float64
    online_compute: true  # Computed at serving time
    tags: [interaction, prediction]

Feature Stores

Training-Serving Skew

This is particularly dangerous because:

Models may still produce predictions (no obvious errors)
Performance degradation can be gradual and hard to detect
The root cause is often far from the observed problem
Traditional software testing often misses these issues

Sources of Training-Serving Skew

•Different Code Paths: Training features computed in Spark; serving features computed in Python. Subtle implementation differences cause skew.
•Time-Based Features: Training uses complete daily aggregates; serving uses partial day data. '30-day average' means different things.
•Data Leakage: Training accidentally includes information from the future. Model performs well in training but fails in production.
•Preprocessing Differences: Different tokenization, normalization, or encoding between training and serving.
•Missing Value Handling: Training imputes missing values with historical statistics; serving uses different defaults.
•Feature Freshness: Training uses data from hours ago; serving uses real-time data with different characteristics.
•Schema Drift: Source data schema changes after training data was generated.
•Null Semantics: Null values have different meanings in different systems (missing vs. zero vs. default).

Detection Strategies

Proactive monitoring for training-serving skew requires multi-layered detection:

Training-Serving Skew Detection Methods
Method	What It Catches	Implementation
Unit Tests	Code path differences	Test feature functions with identical inputs, compare outputs
Statistical Monitoring	Distribution drift	Compare feature distributions between training and serving
Prediction Logging	Model input differences	Log serving inputs, compare to training data statistics
Dual Path Validation	Implementation bugs	Run training and serving code on same inputs, compare
Shadow Mode Testing	End-to-end skew	Serve model predictions without acting on them, measure offline

Prevention Strategies

1. Unified Feature Logic The most robust prevention is using identical code for training and serving:

# Single feature definition used in both contexts
def compute_user_features(user_data: UserData) -> FeatureVector:
    return FeatureVector(
        purchase_count_30d=sum(1 for p in user_data.purchases 
                               if p.date > today() - days(30)),
        avg_order_value_90d=mean(p.amount for p in user_data.purchases 
                                  if p.date > today() - days(90)),
        # ... more features
    )

# Training: compute_user_features(historical_user_data)
# Serving: compute_user_features(current_user_data)

2. Feature Store Architecture Use a feature store that ensures batch-computed features are served identically:

┌─────────────────────────────────────────────────────────────┐
│                      Feature Store                          │
│  ┌──────────────────────────┐  ┌──────────────────────────┐ │
│  │    Offline Store         │  │    Online Store          │ │
│  │ (Historical features)    │──│ (Low-latency serving)    │ │
│  └────────────┬─────────────┘  └────────────┬─────────────┘ │
└───────────────┼─────────────────────────────┼───────────────┘
                │                             │
        Training Data                  Serving Requests

3. Point-in-Time Joins For historical training data, ensure features are computed as they would have been at prediction time:

-- Correct: Point-in-time join
SELECT 
    prediction_request.user_id,
    prediction_request.timestamp,
    features.feature_value
FROM prediction_request
LEFT JOIN features
    ON prediction_request.user_id = features.user_id
    AND features.computed_at <= prediction_request.timestamp
    AND features.computed_at > prediction_request.timestamp - INTERVAL '1 day'

-- Incorrect: Uses latest features (data leakage)
SELECT 
    prediction_request.user_id,
    prediction_request.timestamp,
    features.feature_value  -- This includes future information!
FROM prediction_request
LEFT JOIN features
    ON prediction_request.user_id = features.user_id

The Subtle Killer

Data Validation and Quality

The Data Validation Pyramid:

Effective data validation operates at multiple levels, each catching different types of issues:

Data Validation Levels

•Schema Validation — Do columns exist? Are types correct? Are constraints satisfied? (e.g., non-null, unique, foreign key)
•Semantic Validation — Do values make sense? Is age between 0 and 150? Are prices positive? Are dates not in the future?
•Statistical Validation — Are distributions within expected bounds? Has the mean shifted? Are there unexpected cardinality changes?
•Cross-Table Validation — Do joins produce expected row counts? Are foreign key relationships intact?
•Temporal Validation — Are there gaps in time series? Is data arriving on schedule? Are timestamps monotonically increasing?

data_validation_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Example: Great Expectations-style data validation
from great_expectations import expect
 
# Schema validation
expect(df.columns).to_contain(["user_id", "timestamp", "amount", "category"])
expect(df["user_id"]).to_be_of_type("string")
expect(df["amount"]).to_be_of_type("float64")
 
# Semantic validation
expect(df["amount"]).to_be_between(0, 100000)
expect(df["category"]).to_be_in_set(["electronics", "clothing", "food", "other"])
expect(df["timestamp"]).to_be_parseable_as_datetime()
expect(df["timestamp"]).to_be_less_than(datetime.utcnow())
 
# Statistical validation
expect(df["amount"]).mean_to_be_between(50, 200)
expect(df["amount"]).standard_deviation_to_be_less_than(500)
expect(df["category"]).unique_value_count_to_be_between(3, 10)
 
# Null validation
expect(df["user_id"]).to_have_no_nulls()
expect(df["amount"]).null_ratio_to_be_less_than(0.01)
 
# Referential integrity
expect(df["user_id"]).values_to_be_in(valid_user_ids)

Expectation-Based Data Contracts

Data contracts formalize expectations between data producers and consumers. For ML pipelines, this means:

Input Contracts: What the pipeline expects from upstream sources
Output Contracts: What the pipeline guarantees to downstream consumers
Evolution Rules: How contracts can change over time

Handling Data Quality Issues:

Issue Type	Detection	Response Options
Missing data	Null counts, completeness	Impute, exclude, fail pipeline
Out of range	Value boundaries	Cap, exclude, flag for review
Schema change	Schema comparison	Adapt, reject, alert
Distribution shift	Statistical tests	Alert, retrain model, investigate
Duplicate records	Deduplication checks	Deduplicate, flag source issue
Timestamp issues	Ordering, gaps	Fill gaps, reject, alert

The Fail-Fast vs. Fail-Safe Tradeoff:

Fail-fast: Pipeline halts on data quality violations. Ensures quality but risks availability.
Fail-safe: Pipeline continues with degraded data, logging issues. Maintains availability but risks quality.

Data Quality SLAs

Pipeline Orchestration

Core Orchestration Concerns:

Orchestration Requirements

•Dependency Management: Tasks must execute in correct order. Downstream tasks wait for upstream completion.
•Scheduling: Pipelines run on schedules (daily, hourly) or in response to triggers (new data, model retrain).
•Failure Handling: Tasks can fail. Orchestrator must retry, skip, or alert depending on policy.
•Idempotency: Rerunning a pipeline with the same inputs should produce the same outputs.
•Monitoring: Visibility into task status, duration, resource usage, and failure reasons.
•Backfill: Ability to reprocess historical data without disrupting ongoing pipelines.
•Scalability: Handle pipelines with hundreds of tasks processing terabytes of data.

DAG-Based Orchestration

Most orchestrators model pipelines as Directed Acyclic Graphs (DAGs), where nodes are tasks and edges are dependencies:

                    ┌─────────────┐
                    │ Extract Data │
                    └──────┬──────┘
                           │
            ┌──────────────┼──────────────┐
            ▼              ▼              ▼
    ┌───────────────┐  ┌───────────┐  ┌───────────┐
    │ Validate Data │  │ Clean Data│  │ Log Stats │
    └───────┬───────┘  └─────┬─────┘  └───────────┘
            │                │
            └────────┬───────┘
                     ▼
            ┌───────────────┐
            │ Compute Features│
            └───────┬───────┘
                    │
         ┌──────────┼──────────┐
         ▼          ▼          ▼
    ┌─────────┐  ┌─────────┐  ┌─────────┐
    │ Train   │  │ Validate│  │ Update  │
    │ Model   │  │ Features│  │ Store   │
    └────┬────┘  └─────────┘  └─────────┘
         │
         ▼
    ┌─────────┐
    │ Deploy  │
    └─────────┘

Popular Orchestration Tools:

Tool	Strengths	Best For
Apache Airflow	Mature, extensive integrations, Python-native	Complex ETL, established data teams
Prefect	Modern, cloud-native, dynamic DAGs	Growing teams, hybrid cloud
Dagster	Type-safe, testable, asset-centric	ML-heavy workflows, data quality focus
Kubeflow Pipelines	Kubernetes-native, ML-specific	ML training pipelines on Kubernetes
AWS Step Functions	Serverless, AWS integration	AWS-centric, simple workflows
dbt	SQL-centric, declarative transformations	Analytics, feature engineering in SQL

ml_pipeline_dag.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# Example: Airflow DAG for ML training pipeline
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.sensors.external_task import ExternalTaskSensor
from datetime import datetime, timedelta
 
default_args = {
    'owner': 'ml-team',
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'execution_timeout': timedelta(hours=2),
}
 
with DAG(
    'ml_training_pipeline',
    default_args=default_args,
    schedule_interval='@daily',
    start_date=datetime(2024, 1, 1),
    catchup=False,
    tags=['ml', 'training'],
) as dag:
    
    # Wait for upstream data pipeline
    wait_for_data = ExternalTaskSensor(
        task_id='wait_for_data_pipeline',
        external_dag_id='data_ingestion',
        external_task_id='final_validation',
        mode='poke',
        timeout=3600,
    )
    
    # Validate training data
    validate_data = PythonOperator(
        task_id='validate_training_data',
        python_callable=validate_training_data,
    )
    
    # Compute features
    compute_features = PythonOperator(
        task_id='compute_features',
        python_callable=compute_training_features,
        pool='spark-pool',  # Resource pool for Spark jobs
    )
    
    # Train model
    train_model = PythonOperator(
        task_id='train_model',
        python_callable=train_model,
        pool='gpu-pool',  # GPU resource pool
        execution_timeout=timedelta(hours=4),
    )
    
    # Evaluate model
    evaluate_model = PythonOperator(
        task_id='evaluate_model',
        python_callable=evaluate_model,
    )
    
    # Deploy if evaluation passes
    deploy_model = PythonOperator(
        task_id='deploy_model',
        python_callable=deploy_to_production,
        trigger_rule='all_success',
    )
    
    # Define dependencies
    wait_for_data >> validate_data >> compute_features >> train_model >> evaluate_model >> deploy_model

Orchestration vs. Workflow

Production Architecture Patterns

Mature ML systems employ proven architectural patterns that balance scalability, reliability, and maintainability. Understanding these patterns enables informed design decisions.

Pattern 1: The Feature Platform

Centralize feature management to enable reuse and consistency:

┌─────────────────────────────────────────────────────────────────────┐
│                        Feature Platform                              │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐     │
│  │ Feature Registry│  │ Feature Compute │  │ Feature Serving │     │
│  │  - Definitions  │  │  - Batch ETL    │  │  - Online Store │     │
│  │  - Lineage      │  │  - Streaming    │  │  - Point Lookup │     │
│  │  - Versioning   │  │  - Validation   │  │  - Batch Export │     │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘     │
└───────────────────────────────────────────────────────────────────┘
         │                        │                      │
    ┌────┴────┐              ┌────┴────┐            ┌────┴────┐
    │ Team A  │              │ Team B  │            │ Team C  │
    │ Model 1 │              │ Model 2 │            │ Model 3 │
    └─────────┘              └─────────┘            └─────────┘

Benefits:

Feature reuse across teams and models
Consistent training-serving feature computation
Centralized governance and quality monitoring
Reduced duplicate data engineering effort

Pattern 2: Event-Driven Feature Updates

Update features in response to business events rather than on fixed schedules:

    Events (purchases, clicks, etc.)
                    │
                    ▼
    ┌──────────────────────────────┐
    │        Event Stream          │
    │    (Kafka, Kinesis, etc.)    │
    └──────────────┬───────────────┘
                   │
    ┌──────────────┼──────────────────────────┐
    │              ▼              ▼           │
    │  ┌────────────────┐  ┌───────────────┐  │
    │  │  Stream        │  │ Analytics     │  │
    │  │  Processor     │  │ Consumer      │  │
    │  │  (Flink, etc.) │  │               │  │
    │  └───────┬────────┘  └───────────────┘  │
    │          │                              │
    │          ▼                              │
    │  ┌────────────────┐                     │
    │  │ Feature Store  │                     │
    │  │ (immediate)    │                     │
    │  └────────────────┘                     │
    └─────────────────────────────────────────┘

Benefits:

Features update as soon as relevant events occur
Naturally handles high-velocity data
Easier to reason about feature freshness
Efficient resource usage (process only when events occur)

Pattern 3: Offline-Online Sync Pattern

Maintain consistency between offline (training) and online (serving) data stores:

                   ┌────────────────────┐
                   │ Batch Feature Job  │
                   └─────────┬──────────┘
                             │
            ┌────────────────┼────────────────┐
            │                │                │
            ▼                │                ▼
    ┌──────────────┐         │        ┌──────────────┐
    │ Offline Store │◄───────┴───────►│ Online Store  │
    │  (Data Lake)  │   Sync Process  │   (Redis)    │
    └───────┬───────┘                 └──────┬───────┘
            │                                │
            ▼                                ▼
      Training Data                   Serving Requests

The sync process ensures that:

Online store contains the same feature values as offline
Point-in-time correctness is maintained
Failed syncs are detected and alerted

Pattern Selection

Summary: Data Pipeline Design Mastery

Key Takeaways

•Data pipelines are 80% of ML engineering — Most code in ML systems is data infrastructure, not model code. Invest accordingly.
•Choose batch vs. streaming deliberately — Start with batch unless you have clear real-time requirements. Streaming adds significant complexity.
•Training-serving skew is insidious — Use unified feature logic, feature stores, and point-in-time joins to prevent skew.
•Data validation is essential — Validate at multiple levels: schema, semantic, statistical, and temporal. Define data quality SLAs.
•Orchestration enables reliability — Use DAG-based orchestration with proper failure handling, idempotency, and monitoring.
•Adopt proven architectural patterns — Feature platforms, event-driven updates, and offline-online sync solve common challenges.
•Design for evolution — Data sources change, features evolve, and models are retrained. Build flexibility into your architecture.

What's next:

Page Complete

2 / 5