Design a Machine Learning Feature Store

Design a machine learning feature store — the central infrastructure for defining, computing, storing, and serving ML features across an organisation. The system maintains a dual-store architecture: an offline store (S3/Delta Lake) for historical features with point-in-time correct retrieval for model training, and an online store (Redis/DynamoDB) for low-latency (< 10ms) feature serving during real-time inference. Features are defined once via a DSL and materialised to both stores (batch via Spark, streaming via Flink), ensuring training-serving consistency. The system includes a feature registry for discovery, point-in-time join engine for training data generation, feature monitoring (drift, freshness, quality), and lineage tracking.

Scale Estimates

Metric	Value
Total features registered	10,000+
ML models served	1,000+
Online serving requests/sec	100,000+
Online serving latency (p99)	< 10ms
Entities in online store	100 million+
Features per entity	10–100
Offline store size	10–100 TB
Batch materialisation frequency	Hourly / daily
Streaming materialisation latency	Seconds
Teams sharing features	50+

Non-Functional Requirements

Low-latency serving: Online store serves feature vectors in < 10ms p99; point lookups by entity_id; 100K+ reads/sec; backed by Redis or DynamoDB; critical for real-time model inference
Training-serving consistency: The #1 problem solved by a feature store; unified feature definitions (DSL → batch SQL + streaming SQL); log-and-wait pattern; continuous skew monitoring (online vs offline comparison); prevents the most common cause of ML model degradation
Point-in-time correctness: Offline store supports temporal queries — retrieve features AS OF a past timestamp; ASOF join for training data generation; prevents data leakage (no future data in training); critical for temporal ML models
Freshness: Per-feature SLAs — batch features (hourly/daily), streaming features (seconds); monitoring alerts on SLA violations; hybrid materialisation supports mixing freshness levels
Feature reuse: Central registry prevents duplication; 'user_age' defined once, used by 50 models across 10 teams; saves: engineering effort (no re-implementation), compute/storage (no duplicate pipelines), consistency (one source of truth)
Data quality: Automated checks for nulls, range, type, distribution drift; quarantine bad data; alert on anomalies; training distribution comparison (KL divergence, PSI)

Scale Estimates

Metric

Value

Total features registered

10,000+

ML models served

1,000+

Online serving requests/sec

100,000+

Online serving latency (p99)

< 10ms

Entities in online store

100 million+

Features per entity

10–100

Offline store size

10–100 TB

Batch materialisation frequency

Hourly / daily

Streaming materialisation latency

Seconds

Teams sharing features

50+

Non-Functional Requirements

Low-latency serving: Online store serves feature vectors in < 10ms p99; point lookups by entity_id; 100K+ reads/sec; backed by Redis or DynamoDB; critical for real-time model inference

Training-serving consistency: The #1 problem solved by a feature store; unified feature definitions (DSL → batch SQL + streaming SQL); log-and-wait pattern; continuous skew monitoring (online vs offline comparison); prevents the most common cause of ML model degradation

Point-in-time correctness: Offline store supports temporal queries — retrieve features AS OF a past timestamp; ASOF join for training data generation; prevents data leakage (no future data in training); critical for temporal ML models

Freshness: Per-feature SLAs — batch features (hourly/daily), streaming features (seconds); monitoring alerts on SLA violations; hybrid materialisation supports mixing freshness levels

Feature reuse: Central registry prevents duplication; 'user_age' defined once, used by 50 models across 10 teams; saves: engineering effort (no re-implementation), compute/storage (no duplicate pipelines), consistency (one source of truth)

Data quality: Automated checks for nulls, range, type, distribution drift; quarantine bad data; alert on anomalies; training distribution comparison (KL divergence, PSI)

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Follow-up Deep Dives(Questions an interviewer might ask)

Design a Machine Learning Feature Store

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Follow-up Deep Dives(Questions an interviewer might ask)

Design a Machine Learning Feature Store

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Non-Functional Requirements~3 min

Core Entities~2 min

API Design~3 min

High-Level Design~5 min

Follow-up Deep Dives(Questions an interviewer might ask)

1How would you design the offline feature store for model training?

2How would you design the online feature store for low-latency serving?

3How do you ensure training-serving consistency?

4How does feature materialisation work (batch and streaming)?

5How would you design feature transformations and the computation engine?

6How would you handle feature monitoring and data quality?

7How would you architect the end-to-end feature store system?

Key Topics

Asked At

Design a Machine Learning Feature Store

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Non-Functional Requirements~3 min

Core Entities~2 min

API Design~3 min

High-Level Design~5 min

Follow-up Deep Dives(Questions an interviewer might ask)

1How would you design the offline feature store for model training?

2How would you design the online feature store for low-latency serving?

3How do you ensure training-serving consistency?

4How does feature materialisation work (batch and streaming)?

5How would you design feature transformations and the computation engine?

6How would you handle feature monitoring and data quality?

7How would you architect the end-to-end feature store system?

Key Topics

Asked At