Machine LearningML System Design

ML System Design

LevelAdvanced

Duration120 mins

TopicML System Design

3 / 5

Model Architecture

The Heart of the ML System

The model is where the magic happens—where features become predictions, where data transforms into decisions. But designing model architecture for production is not the same as designing models for research papers or Kaggle competitions.

Production model architecture must balance competing concerns: accuracy (how well does it predict?), latency (how fast can it respond?), interpretability (can we explain its decisions?), maintainability (can we debug and improve it?), and resource efficiency (what does it cost to run?). No single architecture optimizes all of these simultaneously.

This page covers the principles, patterns, and practical considerations for designing model architectures that thrive in production—not just on validation sets.

What You Will Learn

By the end of this page, you will understand how to design model architectures for production ML systems—selecting appropriate model families, structuring multi-stage pipelines, balancing complexity and constraints, and designing for continuous improvement.

Model Selection Framework

Choosing the right model family is the most consequential architectural decision. This choice determines the complexity ceiling, interpretability floor, and operational characteristics of your system.

The Model Selection Decision Tree:

Model Family Selection Guide
Constraint	Favors Simple Models	Favors Complex Models
Latency requirement	< 10ms inference	100ms acceptable
Training data size	< 10K samples	100K samples
Feature engineering effort	Limited, automated	Extensive, expert-driven
Interpretability requirement	Must explain every prediction	Aggregate explanations sufficient
Maintenance capacity	Small team, limited ML expertise	Dedicated ML team
Error tolerance	False positives very costly	Aggregate accuracy sufficient
Deployment environment	Edge, mobile, embedded	Cloud, GPU clusters

The Model Complexity Ladder:

Start simple and add complexity only when necessary:

Level 1: Heuristics and Rules

Implementation: Hard-coded business rules
Pros: Explainable, predictable, no training required, instant updates
Cons: Doesn't generalize, requires domain expertise, maintenance burden
Use when: Problem is well-understood, data is limited, full explainability required

Level 2: Linear Models

Implementation: Logistic regression, linear regression, regularized variants
Pros: Interpretable, fast, robust, well-understood theory
Cons: Limited representation power, manual feature engineering
Use when: Linear relationships capture most of the signal

Level 3: Tree-Based Ensembles

Implementation: Random forests, gradient boosting (XGBoost, LightGBM, CatBoost)
Pros: Handle nonlinearities, good out-of-box performance, feature importance
Cons: Larger models, less interpretable than linear
Use when: Tabular data, need good performance with moderate effort

Level 4: Neural Networks

Implementation: MLPs, CNNs, RNNs, Transformers
Pros: Learn representations, handle complex patterns, state-of-the-art performance
Cons: Data hungry, harder to debug, require more infrastructure
Use when: Rich data (images, text, sequences), have large datasets

Level 5: Large Pre-trained Models

Implementation: Fine-tuned BERT, GPT, vision transformers
Pros: Leverage massive pre-training, handle complex tasks
Cons: Expensive compute, slow inference, hard to customize
Use when: Limited labeled data, need state-of-the-art NLP/vision

The Baseline Imperative

Always start with the simplest reasonable baseline. A logistic regression or gradient boosting model trained in a day often captures 80% of possible improvement. Use this baseline to measure the value of additional complexity. 'This deep learning model beat our XGBoost baseline by 2%' is a reasonable statement. 'This deep learning model achieves 85% accuracy' doesn't tell you if the problem is trivial or impressive.

Multi-Stage Model Pipelines

Production ML systems often decompose complex predictions into multiple stages, each optimized for different objectives. This multi-stage architecture enables better performance, scalability, and maintainability than monolithic models.

The Retrieval-Ranking Pattern:

The most common multi-stage pattern in recommendation and search systems:

    ┌─────────────────────────────────────────────────────────────────┐
    │                      Candidate Generation                        │
    │         (Retrieve ~1000 candidates from millions)                │
    │         Fast, high recall, lower precision                       │
    │         (Embedding similarity, inverted index)                   │
    └─────────────────────────────┬───────────────────────────────────┘
                                  │
                                  ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                         Ranking Stage                            │
    │          (Score ~1000 candidates, select top ~100)               │
    │          More complex model, better relevance                    │
    │          (Neural ranker, gradient boosting)                      │
    └─────────────────────────────┬───────────────────────────────────┘
                                  │
                                  ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                         Re-ranking Stage                         │
    │           (Optimize for business objectives)                     │
    │           Diversity, freshness, fairness constraints             │
    │           (Rule-based + learned combination)                     │
    └─────────────────────────────┬───────────────────────────────────┘
                                  │
                                  ▼
                        Final ranked list (10-20 items)

Why Multi-Stage Works:

Computational Efficiency: Apply expensive models only to pre-filtered candidates
Objective Specialization: Each stage optimizes for its specific goal
Maintainability: Stages can be updated independently
Debuggability: Can diagnose issues at stage boundaries
Experimentation: A/B test stages independently

Multi-Stage Pipeline Design Considerations
Stage	Optimization Goal	Latency Budget	Typical Models
Candidate Generation	High recall, fast	< 20ms for 1000 candidates	ANN, embedding retrieval
Initial Ranking	Relevance scoring	< 50ms for 100-1000 items	GBDT, simple neural nets
Fine Ranking	Precise relevance	< 100ms for 50-100 items	Deep cross networks, BERT
Re-ranking	Business objectives	< 10ms for final ordering	Rules, optimization layers

The Cascade Pattern:

For classification problems, cascade models apply increasingly expensive checks:

    Input → [Fast Filter] → 95% rejected → Exit (negative prediction)
                    │
                    ▼ 5% pass
            [Medium Model] → 80% rejected → Exit (negative prediction)
                    │
                    ▼ 1% pass
            [Heavy Model] → Final prediction with high confidence

Example: Spam Detection Cascade

Stage	Model	Latency	Purpose
1	Blacklist lookup	<1ms	Catch known bad actors
2	Simple regex rules	<5ms	Catch obvious patterns
3	Fast classifier (logistic)	<10ms	Quick content scoring
4	Deep NLP model	<100ms	Sophisticated content analysis

Most requests exit early, reserving expensive computation for ambiguous cases.

The Ensemble Pattern:

Combine multiple models to improve robustness:

    Input → [Model A (Tree-based)] ─────┐
        │                                │
        └→ [Model B (Neural)]     ────────┼──→ [Combiner] → Output
        │                                │
        └→ [Model C (Linear)]     ────────┘

Combination Strategies:

Averaging: Simple mean or weighted average of predictions
Stacking: Train a meta-model on base model outputs
Switching: Route to different models based on input characteristics
Fallback: Use simpler model when main model fails

Pipeline Complexity Creep

Multi-stage pipelines can become unmaintainable if stages proliferate without discipline. Each stage adds: potential failure points, latency, maintenance burden, and debugging complexity. Add stages only when the performance or operational benefits are clear and significant.

Feature and Embedding Architecture

How features enter and flow through the model is a critical architectural decision. The feature architecture determines what information the model can access, how efficiently it processes it, and how the model can be updated and debugged.

Feature Categories and Processing:

Feature Types and Their Architectural Treatment
Feature Type	Examples	Processing Approach	Architectural Notes
Dense Numeric	Age, price, ratings	Normalization → Direct input	Scale matters; use Z-score or min-max
Sparse Categorical	User ID, product category	Embedding lookup	Embedding dimensions scale with cardinality
Multi-hot Categorical	Tags, genres	Pooled embeddings (sum/avg)	Order doesn't matter; pooling captures this
Sequential	Click history, viewed items	Sequence embeddings (RNN, Transformer)	Order matters; attention over sequence
Text	Product descriptions, reviews	Text encoder (TF-IDF → learned)	Pre-trained encoders often best
Cross Features	User-item interactions	Explicit crosses or learned	Crucial for recommendation systems

The Embedding Architecture:

For high-cardinality categorical features (user IDs, product IDs), embeddings are essential:

    ┌─────────────────────────────────────────────────────────────────┐
    │                     Embedding Layer                              │
    │                                                                  │
    │   User ID (10M users)→ [Embedding: 10M × 64] → 64-dim vector   │
    │   Item ID (1M items) → [Embedding: 1M × 32] → 32-dim vector    │
    │   Category (100 cats)→ [Embedding: 100 × 8] → 8-dim vector     │
    │                                                                  │
    └────────────────────────────┬────────────────────────────────────┘
                                 │
                                 ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                    Embedding Combination                         │
    │     [Concatenate] or [Dot Product] or [Attention]               │
    └────────────────────────────┬────────────────────────────────────┘
                                 │
                                 ▼
                          Dense Layers

Embedding Size Guidelines:

Cardinality	Recommended Dimension	Reasoning
< 100	4-8	Small vocabulary, limited expressiveness needed
100 - 10K	16-32	Moderate vocabulary, balance expressiveness/cost
10K - 1M	32-64	Large vocabulary, need richer representations
> 1M	64-128	Very large; consider hashing or hierarchical

Rule of Thumb: Fourth root of cardinality works surprisingly well: √4

The Two-Tower Architecture:

For recommendation and matching problems, separate encoders for queries and items:

                User Features              Item Features
                     │                          │
                     ▼                          ▼
            ┌─────────────────┐        ┌─────────────────┐
            │   User Tower    │        │   Item Tower    │
            │   (DNN layers)  │        │   (DNN layers)  │
            └────────┬────────┘        └────────┬────────┘
                     │                          │
                     ▼                          ▼
               User Embedding            Item Embedding
                   (64-d)                    (64-d)
                     │                          │
                     └──────────┬───────────────┘
                                │
                                ▼
                          Dot Product → Score

Benefits:

Item embeddings can be pre-computed and cached
User tower runs once per request; item tower runs once per item (offline)
Efficient candidate retrieval via approximate nearest neighbors
Clear separation enables independent training/updating

Cold Start Strategies

New users and items have no embedding history. Address cold start with: (1) Default embeddings trained to work reasonably on average, (2) Feature-based embeddings using available attributes (demographics, item metadata), (3) Content-based fallbacks that don't require learned embeddings, (4) Exploration strategies that collect signal quickly.

Latency-Accuracy Tradeoffs

In production systems, model accuracy is never evaluated in isolation—it's always balanced against latency, cost, and operational constraints. The best model on the validation set is often not the best model for production.

The Pareto Frontier:

For any system, there's a curve of optimal tradeoffs between accuracy and latency:

    Accuracy
        ▲
        │     ✕ (infeasible - no model achieves this)
        │   ○ ──○──○
        │  ╱        ╲    Pareto Frontier
        │ ○          ○
        │╱            ╲
        │                ○  ✓ (feasible but not optimal)
        │
        └─────────────────────▶ Latency

Points on the frontier represent non-dominated tradeoffs—you can't improve accuracy without sacrificing latency or vice versa. Your job is to find the frontier and choose where on it to operate.

Latency Optimization Techniques

•Model Distillation: Train a smaller 'student' model to mimic a larger 'teacher' model. Often retains 90%+ of accuracy at 10x speedup.
•Quantization: Reduce precision (float32 → int8). Typically 2-4x speedup with <1% accuracy loss.
•Pruning: Remove unnecessary weights or neurons. Can dramatically reduce model size while preserving accuracy.
•Early Exit: Let easy examples exit early through the network. Hard examples use full depth.
•Caching: Cache predictions for common inputs. Embedding lookups can often be cached.
•Batching: Process multiple requests together. GPU utilization improves dramatically with batching.
•Hardware Acceleration: Use GPUs, TPUs, or specialized inference chips (NVIDIA Triton, AWS Inferentia).
•Approximate Computation: Use approximate nearest neighbors, sampling, or truncated inference.

Optimization Technique Comparison
Technique	Speedup	Accuracy Impact	Implementation Effort	Best For
Distillation	2-10x	Low-Medium	Medium	Deploying large models to resource-constrained environments
Quantization	2-4x	Low	Low	Already-trained models needing faster inference
Pruning	2-10x	Low-Medium	Medium	Sparse patterns in weights, structured architectures
Caching	10-1000x	None	Low	High request overlap, stable embeddings
Batching	2-8x	None	Low	GPU inference, high throughput systems
Hardware Accel	2-20x	None	Medium-High	Production systems with budget for specialized hardware

Latency Budgets in Practice:

Use Case	Total Latency Budget	Model Budget	Implications
Search autocomplete	50ms	20ms	Must use light models, aggressive caching
Product recommendation	100ms	50ms	Can use moderate complexity
Content moderation	500ms	200ms	Can use deeper models
Batch scoring	Minutes	Flexible	Can use heavyweight models
Real-time fraud	50ms	30ms	Critical path; optimize aggressively

The Hidden Latency Sources:

Model inference is often not the latency bottleneck. Consider:

Network latency: Client ↔ server, service ↔ service calls
Feature retrieval: Fetching features from stores, databases
Serialization: Encoding/decoding requests and responses
Queue time: Waiting for compute resources
GC pauses: Garbage collection in JVM/Python
Cold starts: First request after idle period warming up

Measure end-to-end latency, not just model inference time.

P99 vs P50 Latency

User experience is determined by tail latencies (P95, P99), not median latencies. A system with P50=20ms but P99=500ms feels slow. Optimize for tail latency, which often requires reducing variance through techniques like request hedging, timeouts, and graceful degradation.

Model Interpretability Architecture

Interpretability isn't an add-on feature—it's an architectural choice. The level of interpretability required should influence model selection from the beginning, not be retrofitted onto a deployed system.

The Interpretability Spectrum:

Interpretability Levels and Their Tradeoffs
Level	What It Provides	Model Examples	Tradeoffs
Fully Transparent	Complete decision logic visible	Linear models, decision trees, rule lists	Limited expressiveness
Feature Attribution	Importance of each input feature	SHAP, LIME, attention weights	Explanations may be unstable or misleading
Example-Based	Similar training examples	Prototype networks, k-NN components	Requires storing/indexing training data
Concept-Based	High-level concept contributions	Concept bottleneck models, CAVs	Requires concept annotation
Counterfactual	What would change the prediction	Counterfactual generators	Computationally expensive
Limited/None	Model is a black box	Deep neural networks, ensembles	Maximum expressiveness

Designing for Interpretability:

Pattern 1: Interpretable First Stage

Use an interpretable model as the primary decision-maker, with a complex model as a secondary signal:

    Input → [Interpretable Model (Logistic)] → Primary Score
        │                                           │
        └→ [Complex Model (Neural)]  → Adjustment ──┼──→ Final
                                                    ▼
    Explanation comes from interpretable model; complex model
    provides marginal improvement without hiding reasoning.

Pattern 2: Attention as Explanation

Design neural architectures where attention weights provide natural explanations:

    Text: "This product arrived broken and customer service was unhelpful"
           ──────────────────────────────────────────────────────────────
    Attention:  0.1    0.8      0.1     0.3   0.1       0.2    0.1   0.7
                       (high)                                        (high)
    
    Prediction: Negative review
    Explanation: High attention on "broken" and "unhelpful"

Caution: Attention weights don't always reflect true importance. Use with appropriate skepticism.

Pattern 3: Modular Architecture with Inspectable Components

    Raw Features → [Feature Transform Module] → Standardized Features
                           │ (inspectable)
                           ▼
    → [Feature Interaction Module] → Crossed Features
                           │ (inspectable) 
                           ▼
    → [Prediction Module] → Score
                           │
                           └→ Feature Importance (from gradients/SHAP)

Each module can be inspected independently, and feature importance can be computed at multiple levels.

Explanation Budget

Generating explanations has computational cost. For real-time systems, you might: (1) Generate explanations only on-demand, not for every prediction, (2) Use fast approximation methods, (3) Pre-compute explanations for common cases, (4) Cache explanation templates and fill in values.

Model Versioning and Experimentation Architecture

Production ML systems are never static. Models are continuously retrained, experimented upon, and improved. The architecture must support safe experimentation and controlled rollout without disrupting production traffic.

The Model Registry Pattern:

    ┌─────────────────────────────────────────────────────────────────┐
    │                      Model Registry                              │
    │  ┌───────────────────────────────────────────────────────────┐  │
    │  │ Model: churn_predictor                                     │  │
    │  │ ├─ v1.0.0 (archived)                                       │  │
    │  │ ├─ v1.1.0 (production)  ←─── Currently serving 95%        │  │
    │  │ ├─ v1.2.0 (canary)      ←─── Serving 5% for validation    │  │
    │  │ └─ v2.0.0 (staging)     ←─── Pending promotion            │  │
    │  │                                                            │  │
    │  │ Metadata per version:                                      │  │
    │  │   - Training data version                                  │  │
    │  │   - Hyperparameters                                        │  │
    │  │   - Evaluation metrics                                     │  │
    │  │   - Feature dependencies                                   │  │
    │  │   - Artifact locations                                     │  │
    │  └───────────────────────────────────────────────────────────┘  │
    └─────────────────────────────────────────────────────────────────┘

Model Lifecycle States:

State	Description	Traffic	Actions
Development	Active training and evaluation	None	Training, offline eval
Staging	Passed offline eval, pending online	Shadow/0%	Shadow testing
Canary	Initial production exposure	1-5%	Online metrics collection
Production	Primary serving model	50-99%	Monitoring
Deprecated	Scheduled for removal	Decreasing	Migration support
Archived	No longer serving	None	Regulatory retention

A/B Testing Architecture:

Online experiments require infrastructure to route traffic and measure outcomes:

    Request → [Experiment Assignment] → experiment_id, variant_id
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
    [Model A]      [Model B]      [Model C]
    (control)     (treatment 1)  (treatment 2)
        │              │              │
        └──────────────┼──────────────┘
                       │
                       ▼
              [Prediction + Logging]
                       │
                       ▼
              [Outcome Tracking]
                       │
                       ▼
              [Statistical Analysis]

Experiment Assignment Requirements:

Deterministic: Same user gets same variant across sessions
Balanced: Variants receive intended traffic percentage
Isolated: Experiments don't interfere with each other
Auditable: Can reconstruct what variant any user saw

Common A/B Testing Pitfalls:

Peeking: Checking results too early, inflating false positive rate
Multiple comparisons: Testing many metrics without correction
Network effects: User interactions confound isolation
Simpson's paradox: Aggregate results hide segment differences
Short-term/long-term mismatch: Short-term wins cause long-term losses

Shadow Mode Testing

Before any live traffic, run new models in shadow mode: they receive real requests, generate predictions, but predictions aren't used. This validates: (1) Model runs successfully on production data, (2) Latency is acceptable, (3) Predictions are reasonable. Shadow mode eliminates operational risk before business risk exposure.

Failure Modes and Resilience

Production models fail. The question is not whether they will fail, but how they will fail and what happens when they do. Designing for resilience means anticipating failure modes and building in graceful degradation.

Common Model Failure Modes:

Model Failure Categories

•Input Failures — Missing features, unexpected values, schema changes. Model receives data it wasn't designed for.
•Infrastructure Failures — Model server crash, memory exhaustion, network partition. Model can't respond at all.
•Latency Failures — Model too slow for latency budget. Response delayed or timed out.
•Accuracy Degradation — Data drift, concept drift, adversarial inputs. Model responds but with poor quality.
•Output Failures — NaN, infinity, out-of-range values. Model produces technically invalid responses.
•Capacity Failures — Traffic exceeds provisioned capacity. System overwhelmed.

Resilience Patterns:

Pattern 1: Fallback Cascade

    Request → [Primary Model]
                    │
            Success? ├──Yes──→ Return prediction
                    │
                   No
                    ▼
              [Fallback Model (simpler)]
                    │
            Success? ├──Yes──→ Return prediction
                    │
                   No
                    ▼
              [Rule-based Default]
                    │
                    └──→ Return default prediction

Each fallback is simpler and more reliable but less accurate. The system degrades gracefully rather than failing completely.

Pattern 2: Circuit Breaker

                    ┌─────────────────────────────────────┐
    Request → │ Circuit Breaker State Machine          │
              │                                        │
              │   CLOSED ──(failures)──→ OPEN         │
              │     ↑                      │          │
              │     └──(success)─ HALF-OPEN ←(timeout)│
              └─────────────────────────────────────┘
              
    CLOSED: Normal operation, requests pass through
    OPEN: Skip model entirely, return fallback
    HALF-OPEN: Test if model recovered, transition back if success

Prevent cascade failures by failing fast when model is unhealthy.

Pattern 3: Request Hedging

For critical low-latency predictions, send parallel requests:

    Request → [Model Instance 1] ─┬─→ Return first response
          └→ [Model Instance 2] ─┘    (cancel the other)

Reduces tail latency at cost of compute duplication.

Failure Response Strategies
Failure Type	Detection	Response	Recovery
Input validation failure	Schema/value checks	Reject or impute	Log and alert for investigation
Timeout	Deadline exceeded	Return fallback	Monitor for capacity issues
Model error	Exception caught	Circuit breaker + fallback	Automatic or manual rollback
Quality degradation	Monitoring alerts	Reduce traffic, investigate	Retrain or rollback
Capacity exhaustion	Queue depth, latency spikes	Load shed, auto-scale	Increase capacity

Chaos Engineering for ML

Regularly inject failures in testing/staging: corrupt inputs, kill model servers, delay responses, simulate traffic spikes. Verify that fallbacks activate correctly and degradation is graceful. You'd rather discover failure modes in controlled testing than in production incidents.

Summary: Model Architecture Mastery

Model architecture for production systems requires balancing many competing concerns—accuracy, latency, interpretability, maintainability, and resilience. The best architecture is not necessarily the most accurate one, but the one that best serves the complete system requirements.

Key Takeaways

•Start simple, add complexity when justified — Use the simplest model that meets requirements. Baseline with logistic regression or gradient boosting before deep learning.
•Design multi-stage pipelines — Break complex predictions into stages: candidate generation, ranking, re-ranking. Each stage serves a specific purpose.
•Treat embeddings as first-class architecture — Embedding design (dimensions, cold-start handling, updates) significantly impacts system behavior.
•Optimize for tail latency, not average — P99 latency determines user experience. Use batching, caching, and approximation where appropriate.
•Build interpretability into the architecture — Choose model families that match interpretability requirements. Retrofitting explanations is harder.
•Design for experimentation — Model registry, versioning, A/B testing infrastructure enable continuous improvement.
•Plan for failures — Implement fallback cascades, circuit breakers, and graceful degradation. Every model will eventually fail.

What's next:

With model architecture designed, the next challenge is serving predictions reliably at scale. The next page explores Serving Infrastructure—how to deploy models for production inference, including serving patterns, scaling strategies, and operational considerations.

Page Complete

You now understand how to design model architectures for production ML systems—from selecting model families through multi-stage pipelines, embedding design, latency optimization, interpretability, experimentation, and resilience. Next, we deploy these models to production serving infrastructure.

3 / 5

Loading learning content...

Machine LearningML System Design

ML System Design

LevelAdvanced

Duration120 mins

TopicML System Design

3 / 5

Model Architecture

The Heart of the ML System

This page covers the principles, patterns, and practical considerations for designing model architectures that thrive in production—not just on validation sets.

What You Will Learn

Model Selection Framework

The Model Selection Decision Tree:

Model Family Selection Guide
Constraint	Favors Simple Models	Favors Complex Models
Latency requirement	< 10ms inference	100ms acceptable
Training data size	< 10K samples	100K samples
Feature engineering effort	Limited, automated	Extensive, expert-driven
Interpretability requirement	Must explain every prediction	Aggregate explanations sufficient
Maintenance capacity	Small team, limited ML expertise	Dedicated ML team
Error tolerance	False positives very costly	Aggregate accuracy sufficient
Deployment environment	Edge, mobile, embedded	Cloud, GPU clusters

The Model Complexity Ladder:

Start simple and add complexity only when necessary:

Level 1: Heuristics and Rules

Implementation: Hard-coded business rules
Pros: Explainable, predictable, no training required, instant updates
Cons: Doesn't generalize, requires domain expertise, maintenance burden
Use when: Problem is well-understood, data is limited, full explainability required

Level 2: Linear Models

Implementation: Logistic regression, linear regression, regularized variants
Pros: Interpretable, fast, robust, well-understood theory
Cons: Limited representation power, manual feature engineering
Use when: Linear relationships capture most of the signal

Level 3: Tree-Based Ensembles

Implementation: Random forests, gradient boosting (XGBoost, LightGBM, CatBoost)
Pros: Handle nonlinearities, good out-of-box performance, feature importance
Cons: Larger models, less interpretable than linear
Use when: Tabular data, need good performance with moderate effort

Level 4: Neural Networks

Implementation: MLPs, CNNs, RNNs, Transformers
Pros: Learn representations, handle complex patterns, state-of-the-art performance
Cons: Data hungry, harder to debug, require more infrastructure
Use when: Rich data (images, text, sequences), have large datasets

Level 5: Large Pre-trained Models

Implementation: Fine-tuned BERT, GPT, vision transformers
Pros: Leverage massive pre-training, handle complex tasks
Cons: Expensive compute, slow inference, hard to customize
Use when: Limited labeled data, need state-of-the-art NLP/vision

The Baseline Imperative

Multi-Stage Model Pipelines

The Retrieval-Ranking Pattern:

The most common multi-stage pattern in recommendation and search systems:

    ┌─────────────────────────────────────────────────────────────────┐
    │                      Candidate Generation                        │
    │         (Retrieve ~1000 candidates from millions)                │
    │         Fast, high recall, lower precision                       │
    │         (Embedding similarity, inverted index)                   │
    └─────────────────────────────┬───────────────────────────────────┘
                                  │
                                  ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                         Ranking Stage                            │
    │          (Score ~1000 candidates, select top ~100)               │
    │          More complex model, better relevance                    │
    │          (Neural ranker, gradient boosting)                      │
    └─────────────────────────────┬───────────────────────────────────┘
                                  │
                                  ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                         Re-ranking Stage                         │
    │           (Optimize for business objectives)                     │
    │           Diversity, freshness, fairness constraints             │
    │           (Rule-based + learned combination)                     │
    └─────────────────────────────┬───────────────────────────────────┘
                                  │
                                  ▼
                        Final ranked list (10-20 items)

Why Multi-Stage Works:

Computational Efficiency: Apply expensive models only to pre-filtered candidates
Objective Specialization: Each stage optimizes for its specific goal
Maintainability: Stages can be updated independently
Debuggability: Can diagnose issues at stage boundaries
Experimentation: A/B test stages independently

Multi-Stage Pipeline Design Considerations
Stage	Optimization Goal	Latency Budget	Typical Models
Candidate Generation	High recall, fast	< 20ms for 1000 candidates	ANN, embedding retrieval
Initial Ranking	Relevance scoring	< 50ms for 100-1000 items	GBDT, simple neural nets
Fine Ranking	Precise relevance	< 100ms for 50-100 items	Deep cross networks, BERT
Re-ranking	Business objectives	< 10ms for final ordering	Rules, optimization layers

The Cascade Pattern:

For classification problems, cascade models apply increasingly expensive checks:

    Input → [Fast Filter] → 95% rejected → Exit (negative prediction)
                    │
                    ▼ 5% pass
            [Medium Model] → 80% rejected → Exit (negative prediction)
                    │
                    ▼ 1% pass
            [Heavy Model] → Final prediction with high confidence

Example: Spam Detection Cascade

Stage	Model	Latency	Purpose
1	Blacklist lookup	<1ms	Catch known bad actors
2	Simple regex rules	<5ms	Catch obvious patterns
3	Fast classifier (logistic)	<10ms	Quick content scoring
4	Deep NLP model	<100ms	Sophisticated content analysis

Most requests exit early, reserving expensive computation for ambiguous cases.

The Ensemble Pattern:

Combine multiple models to improve robustness:

    Input → [Model A (Tree-based)] ─────┐
        │                                │
        └→ [Model B (Neural)]     ────────┼──→ [Combiner] → Output
        │                                │
        └→ [Model C (Linear)]     ────────┘

Combination Strategies:

Averaging: Simple mean or weighted average of predictions
Stacking: Train a meta-model on base model outputs
Switching: Route to different models based on input characteristics
Fallback: Use simpler model when main model fails

Pipeline Complexity Creep

Feature and Embedding Architecture

Feature Categories and Processing:

Feature Types and Their Architectural Treatment
Feature Type	Examples	Processing Approach	Architectural Notes
Dense Numeric	Age, price, ratings	Normalization → Direct input	Scale matters; use Z-score or min-max
Sparse Categorical	User ID, product category	Embedding lookup	Embedding dimensions scale with cardinality
Multi-hot Categorical	Tags, genres	Pooled embeddings (sum/avg)	Order doesn't matter; pooling captures this
Sequential	Click history, viewed items	Sequence embeddings (RNN, Transformer)	Order matters; attention over sequence
Text	Product descriptions, reviews	Text encoder (TF-IDF → learned)	Pre-trained encoders often best
Cross Features	User-item interactions	Explicit crosses or learned	Crucial for recommendation systems

The Embedding Architecture:

For high-cardinality categorical features (user IDs, product IDs), embeddings are essential:

    ┌─────────────────────────────────────────────────────────────────┐
    │                     Embedding Layer                              │
    │                                                                  │
    │   User ID (10M users)→ [Embedding: 10M × 64] → 64-dim vector   │
    │   Item ID (1M items) → [Embedding: 1M × 32] → 32-dim vector    │
    │   Category (100 cats)→ [Embedding: 100 × 8] → 8-dim vector     │
    │                                                                  │
    └────────────────────────────┬────────────────────────────────────┘
                                 │
                                 ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                    Embedding Combination                         │
    │     [Concatenate] or [Dot Product] or [Attention]               │
    └────────────────────────────┬────────────────────────────────────┘
                                 │
                                 ▼
                          Dense Layers

Embedding Size Guidelines:

Cardinality	Recommended Dimension	Reasoning
< 100	4-8	Small vocabulary, limited expressiveness needed
100 - 10K	16-32	Moderate vocabulary, balance expressiveness/cost
10K - 1M	32-64	Large vocabulary, need richer representations
> 1M	64-128	Very large; consider hashing or hierarchical

Rule of Thumb: Fourth root of cardinality works surprisingly well: √4

The Two-Tower Architecture:

For recommendation and matching problems, separate encoders for queries and items:

                User Features              Item Features
                     │                          │
                     ▼                          ▼
            ┌─────────────────┐        ┌─────────────────┐
            │   User Tower    │        │   Item Tower    │
            │   (DNN layers)  │        │   (DNN layers)  │
            └────────┬────────┘        └────────┬────────┘
                     │                          │
                     ▼                          ▼
               User Embedding            Item Embedding
                   (64-d)                    (64-d)
                     │                          │
                     └──────────┬───────────────┘
                                │
                                ▼
                          Dot Product → Score

Benefits:

Item embeddings can be pre-computed and cached
User tower runs once per request; item tower runs once per item (offline)
Efficient candidate retrieval via approximate nearest neighbors
Clear separation enables independent training/updating

Cold Start Strategies

Latency-Accuracy Tradeoffs

The Pareto Frontier:

For any system, there's a curve of optimal tradeoffs between accuracy and latency:

    Accuracy
        ▲
        │     ✕ (infeasible - no model achieves this)
        │   ○ ──○──○
        │  ╱        ╲    Pareto Frontier
        │ ○          ○
        │╱            ╲
        │                ○  ✓ (feasible but not optimal)
        │
        └─────────────────────▶ Latency

Points on the frontier represent non-dominated tradeoffs—you can't improve accuracy without sacrificing latency or vice versa. Your job is to find the frontier and choose where on it to operate.

Latency Optimization Techniques

•Model Distillation: Train a smaller 'student' model to mimic a larger 'teacher' model. Often retains 90%+ of accuracy at 10x speedup.
•Quantization: Reduce precision (float32 → int8). Typically 2-4x speedup with <1% accuracy loss.
•Pruning: Remove unnecessary weights or neurons. Can dramatically reduce model size while preserving accuracy.
•Early Exit: Let easy examples exit early through the network. Hard examples use full depth.
•Caching: Cache predictions for common inputs. Embedding lookups can often be cached.
•Batching: Process multiple requests together. GPU utilization improves dramatically with batching.
•Hardware Acceleration: Use GPUs, TPUs, or specialized inference chips (NVIDIA Triton, AWS Inferentia).
•Approximate Computation: Use approximate nearest neighbors, sampling, or truncated inference.

Optimization Technique Comparison
Technique	Speedup	Accuracy Impact	Implementation Effort	Best For
Distillation	2-10x	Low-Medium	Medium	Deploying large models to resource-constrained environments
Quantization	2-4x	Low	Low	Already-trained models needing faster inference
Pruning	2-10x	Low-Medium	Medium	Sparse patterns in weights, structured architectures
Caching	10-1000x	None	Low	High request overlap, stable embeddings
Batching	2-8x	None	Low	GPU inference, high throughput systems
Hardware Accel	2-20x	None	Medium-High	Production systems with budget for specialized hardware

Latency Budgets in Practice:

Use Case	Total Latency Budget	Model Budget	Implications
Search autocomplete	50ms	20ms	Must use light models, aggressive caching
Product recommendation	100ms	50ms	Can use moderate complexity
Content moderation	500ms	200ms	Can use deeper models
Batch scoring	Minutes	Flexible	Can use heavyweight models
Real-time fraud	50ms	30ms	Critical path; optimize aggressively

The Hidden Latency Sources:

Model inference is often not the latency bottleneck. Consider:

Network latency: Client ↔ server, service ↔ service calls
Feature retrieval: Fetching features from stores, databases
Serialization: Encoding/decoding requests and responses
Queue time: Waiting for compute resources
GC pauses: Garbage collection in JVM/Python
Cold starts: First request after idle period warming up

Measure end-to-end latency, not just model inference time.

P99 vs P50 Latency

Model Interpretability Architecture

The Interpretability Spectrum:

Interpretability Levels and Their Tradeoffs
Level	What It Provides	Model Examples	Tradeoffs
Fully Transparent	Complete decision logic visible	Linear models, decision trees, rule lists	Limited expressiveness
Feature Attribution	Importance of each input feature	SHAP, LIME, attention weights	Explanations may be unstable or misleading
Example-Based	Similar training examples	Prototype networks, k-NN components	Requires storing/indexing training data
Concept-Based	High-level concept contributions	Concept bottleneck models, CAVs	Requires concept annotation
Counterfactual	What would change the prediction	Counterfactual generators	Computationally expensive
Limited/None	Model is a black box	Deep neural networks, ensembles	Maximum expressiveness

Designing for Interpretability:

Pattern 1: Interpretable First Stage

Use an interpretable model as the primary decision-maker, with a complex model as a secondary signal:

    Input → [Interpretable Model (Logistic)] → Primary Score
        │                                           │
        └→ [Complex Model (Neural)]  → Adjustment ──┼──→ Final
                                                    ▼
    Explanation comes from interpretable model; complex model
    provides marginal improvement without hiding reasoning.

Pattern 2: Attention as Explanation

Design neural architectures where attention weights provide natural explanations:

    Text: "This product arrived broken and customer service was unhelpful"
           ──────────────────────────────────────────────────────────────
    Attention:  0.1    0.8      0.1     0.3   0.1       0.2    0.1   0.7
                       (high)                                        (high)
    
    Prediction: Negative review
    Explanation: High attention on "broken" and "unhelpful"

Caution: Attention weights don't always reflect true importance. Use with appropriate skepticism.

Pattern 3: Modular Architecture with Inspectable Components

    Raw Features → [Feature Transform Module] → Standardized Features
                           │ (inspectable)
                           ▼
    → [Feature Interaction Module] → Crossed Features
                           │ (inspectable) 
                           ▼
    → [Prediction Module] → Score
                           │
                           └→ Feature Importance (from gradients/SHAP)

Each module can be inspected independently, and feature importance can be computed at multiple levels.

Explanation Budget

Model Versioning and Experimentation Architecture

The Model Registry Pattern:

    ┌─────────────────────────────────────────────────────────────────┐
    │                      Model Registry                              │
    │  ┌───────────────────────────────────────────────────────────┐  │
    │  │ Model: churn_predictor                                     │  │
    │  │ ├─ v1.0.0 (archived)                                       │  │
    │  │ ├─ v1.1.0 (production)  ←─── Currently serving 95%        │  │
    │  │ ├─ v1.2.0 (canary)      ←─── Serving 5% for validation    │  │
    │  │ └─ v2.0.0 (staging)     ←─── Pending promotion            │  │
    │  │                                                            │  │
    │  │ Metadata per version:                                      │  │
    │  │   - Training data version                                  │  │
    │  │   - Hyperparameters                                        │  │
    │  │   - Evaluation metrics                                     │  │
    │  │   - Feature dependencies                                   │  │
    │  │   - Artifact locations                                     │  │
    │  └───────────────────────────────────────────────────────────┘  │
    └─────────────────────────────────────────────────────────────────┘

Model Lifecycle States:

State	Description	Traffic	Actions
Development	Active training and evaluation	None	Training, offline eval
Staging	Passed offline eval, pending online	Shadow/0%	Shadow testing
Canary	Initial production exposure	1-5%	Online metrics collection
Production	Primary serving model	50-99%	Monitoring
Deprecated	Scheduled for removal	Decreasing	Migration support
Archived	No longer serving	None	Regulatory retention

A/B Testing Architecture:

Online experiments require infrastructure to route traffic and measure outcomes:

    Request → [Experiment Assignment] → experiment_id, variant_id
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
    [Model A]      [Model B]      [Model C]
    (control)     (treatment 1)  (treatment 2)
        │              │              │
        └──────────────┼──────────────┘
                       │
                       ▼
              [Prediction + Logging]
                       │
                       ▼
              [Outcome Tracking]
                       │
                       ▼
              [Statistical Analysis]

Experiment Assignment Requirements:

Deterministic: Same user gets same variant across sessions
Balanced: Variants receive intended traffic percentage
Isolated: Experiments don't interfere with each other
Auditable: Can reconstruct what variant any user saw

Common A/B Testing Pitfalls:

Peeking: Checking results too early, inflating false positive rate
Multiple comparisons: Testing many metrics without correction
Network effects: User interactions confound isolation
Simpson's paradox: Aggregate results hide segment differences
Short-term/long-term mismatch: Short-term wins cause long-term losses

Shadow Mode Testing

Failure Modes and Resilience

Common Model Failure Modes:

Model Failure Categories

•Input Failures — Missing features, unexpected values, schema changes. Model receives data it wasn't designed for.
•Infrastructure Failures — Model server crash, memory exhaustion, network partition. Model can't respond at all.
•Latency Failures — Model too slow for latency budget. Response delayed or timed out.
•Accuracy Degradation — Data drift, concept drift, adversarial inputs. Model responds but with poor quality.
•Output Failures — NaN, infinity, out-of-range values. Model produces technically invalid responses.
•Capacity Failures — Traffic exceeds provisioned capacity. System overwhelmed.

Resilience Patterns:

Pattern 1: Fallback Cascade

    Request → [Primary Model]
                    │
            Success? ├──Yes──→ Return prediction
                    │
                   No
                    ▼
              [Fallback Model (simpler)]
                    │
            Success? ├──Yes──→ Return prediction
                    │
                   No
                    ▼
              [Rule-based Default]
                    │
                    └──→ Return default prediction

Each fallback is simpler and more reliable but less accurate. The system degrades gracefully rather than failing completely.

Pattern 2: Circuit Breaker

                    ┌─────────────────────────────────────┐
    Request → │ Circuit Breaker State Machine          │
              │                                        │
              │   CLOSED ──(failures)──→ OPEN         │
              │     ↑                      │          │
              │     └──(success)─ HALF-OPEN ←(timeout)│
              └─────────────────────────────────────┘
              
    CLOSED: Normal operation, requests pass through
    OPEN: Skip model entirely, return fallback
    HALF-OPEN: Test if model recovered, transition back if success

Prevent cascade failures by failing fast when model is unhealthy.

Pattern 3: Request Hedging

For critical low-latency predictions, send parallel requests:

    Request → [Model Instance 1] ─┬─→ Return first response
          └→ [Model Instance 2] ─┘    (cancel the other)

Reduces tail latency at cost of compute duplication.

Failure Response Strategies
Failure Type	Detection	Response	Recovery
Input validation failure	Schema/value checks	Reject or impute	Log and alert for investigation
Timeout	Deadline exceeded	Return fallback	Monitor for capacity issues
Model error	Exception caught	Circuit breaker + fallback	Automatic or manual rollback
Quality degradation	Monitoring alerts	Reduce traffic, investigate	Retrain or rollback
Capacity exhaustion	Queue depth, latency spikes	Load shed, auto-scale	Increase capacity

Chaos Engineering for ML

Summary: Model Architecture Mastery

Key Takeaways

•Start simple, add complexity when justified — Use the simplest model that meets requirements. Baseline with logistic regression or gradient boosting before deep learning.
•Design multi-stage pipelines — Break complex predictions into stages: candidate generation, ranking, re-ranking. Each stage serves a specific purpose.
•Treat embeddings as first-class architecture — Embedding design (dimensions, cold-start handling, updates) significantly impacts system behavior.
•Optimize for tail latency, not average — P99 latency determines user experience. Use batching, caching, and approximation where appropriate.
•Build interpretability into the architecture — Choose model families that match interpretability requirements. Retrofitting explanations is harder.
•Design for experimentation — Model registry, versioning, A/B testing infrastructure enable continuous improvement.
•Plan for failures — Implement fallback cascades, circuit breakers, and graceful degradation. Every model will eventually fail.

What's next:

Page Complete

3 / 5