Loading system design...
Design a machine learning feature store — the central infrastructure for defining, computing, storing, and serving ML features across an organisation. The system maintains a dual-store architecture: an offline store (S3/Delta Lake) for historical features with point-in-time correct retrieval for model training, and an online store (Redis/DynamoDB) for low-latency (< 10ms) feature serving during real-time inference. Features are defined once via a DSL and materialised to both stores (batch via Spark, streaming via Flink), ensuring training-serving consistency. The system includes a feature registry for discovery, point-in-time join engine for training data generation, feature monitoring (drift, freshness, quality), and lineage tracking.
| Metric | Value |
|---|---|
| Total features registered | 10,000+ |
| ML models served | 1,000+ |
| Online serving requests/sec | 100,000+ |
| Online serving latency (p99) | < 10ms |
| Entities in online store | 100 million+ |
| Features per entity | 10–100 |
| Offline store size | 10–100 TB |
| Batch materialisation frequency | Hourly / daily |
| Streaming materialisation latency | Seconds |
| Teams sharing features | 50+ |
Feature registry and discovery: central catalog of all ML features across the organisation; each feature registered with metadata — name, description, data type, owner team, source (raw data table/stream), computation logic, freshness SLA, entity (user_id, item_id, order_id); data scientists search and discover existing features before creating new ones
Offline feature store (batch): store historical feature values for model training; features computed via batch pipelines (Spark/BigQuery) from data warehouse; point-in-time correct retrievals — for a training example at time T, return feature values as they existed at time T (prevent data leakage / lookahead bias); store in columnar format (Parquet/Delta Lake) for efficient training data generation
Online feature store (serving): serve feature values at low latency (< 10ms p99) for real-time model inference; when a model receives a prediction request → fetch features for the entity (e.g., user_id=123) from the online store → feed to model; backed by a low-latency key-value store (Redis, DynamoDB, Bigtable)
Feature materialisation: populate the online store from batch or streaming sources; batch materialisation: scheduled Spark job reads features from offline store → writes to online KV store (periodic refresh, e.g., hourly/daily); stream materialisation: Flink/Spark Streaming job computes features from event streams → writes to online store in real-time (seconds-fresh)
Training-serving consistency: the features used during model training MUST match the features served during inference (training-serving skew is a top cause of ML system failures); the same feature definition produces values for both offline training and online serving; transformations defined once, executed in both contexts
Point-in-time joins: when generating training datasets, join features to training examples at the correct timestamp — for each (entity_id, event_time) in the training set, retrieve the latest feature value that was available BEFORE event_time; prevents future data from leaking into training; critical for temporal correctness
Feature transformations: support defining transformations on raw data — aggregations (count, sum, avg over sliding windows), encodings (one-hot, embedding lookups), derived features (ratio of two features), time-based (days since last purchase, time of day); transformations versioned and tracked
Feature freshness and SLAs: different features have different freshness requirements — user_lifetime_purchases: daily refresh OK; user_last_click_item: must be real-time (< 1 minute); define freshness SLAs per feature; monitoring alerts when SLA is violated (feature staleness exceeds threshold)
Feature versioning and lineage: track feature versions (v1, v2 with changed computation logic); lineage: trace from a feature back to its raw data sources and transformation pipelines; when a raw data source changes → identify all downstream features affected; essential for debugging, auditing, and compliance
Access control and sharing: features owned by teams; fine-grained access control (team A can read features owned by team B but not modify); feature sharing across teams reduces duplication (e.g., 'user_age' used by 10+ models across 5 teams); usage tracking (which models use which features)
Non-functional requirements define the system qualities critical to your users. Frame them as 'The system should be able to...' statements. These will guide your deep dives later.
Think about CAP theorem trade-offs, scalability limits, latency targets, durability guarantees, security requirements, fault tolerance, and compliance needs.
Frame NFRs for this specific system. 'Low latency search under 100ms' is far more valuable than just 'low latency'.
Add concrete numbers: 'P99 response time < 500ms', '99.9% availability', '10M DAU'. This drives architectural decisions.
Choose the 3-5 most critical NFRs. Every system should be 'scalable', but what makes THIS system's scaling uniquely challenging?