Machine LearningML System Design

ML System Design

LevelAdvanced

Duration120 mins

TopicML System Design

4 / 5

Serving Infrastructure

From Models to Predictions at Scale

A model sitting in a notebook produces no value. Value is created when that model serves predictions to users, applications, and downstream systems—reliably, quickly, and at scale. This is the domain of serving infrastructure.

Serving infrastructure is where ML meets distributed systems engineering. The challenges are familiar to anyone who has built scalable web services—load balancing, caching, failure handling, resource management—but ML adds unique wrinkles: large model sizes, GPU acceleration, dynamic batching, and the constant evolution of models through retraining and experimentation.

This page covers the principles and practices for building serving infrastructure that transforms trained models into production prediction services.

What You Will Learn

By the end of this page, you will understand how to deploy and operate ML models in production—from serving patterns and scaling strategies to hardware selection and operational best practices. You'll learn to design serving infrastructure that meets latency, throughput, and availability requirements.

Serving Patterns

Different use cases require different serving patterns. The right pattern depends on latency requirements, traffic patterns, and integration constraints.

The Serving Pattern Spectrum:

ML Serving Patterns Comparison
Pattern	Latency	When to Update	Use Cases	Complexity
Batch Prediction	Hours	On schedule	Offline analytics, daily recommendations	Low
Precomputed	Milliseconds	On schedule/trigger	Limited input space, cacheable results	Low-Medium
Online Prediction	Milliseconds	Real-time	User-facing features, real-time decisions	Medium-High
Streaming	Seconds-Minutes	Continuous	Event processing, real-time aggregations	High
Edge Deployment	Milliseconds	Periodic sync	Mobile, IoT, offline-capable	High

Pattern 1: Batch Prediction

Generate predictions for all entities on a schedule, store results:

    ┌─────────────┐      ┌─────────────┐      ┌─────────────┐
    │ All Users   │ ──→  │ Model       │ ──→  │ Predictions │
    │ (from DB)   │      │ (batch job) │      │ (to DB)     │
    └─────────────┘      └─────────────┘      └─────────────┘
                         Runs nightly
                         
    At serving time:
    Request → Database lookup → Return precomputed prediction

Advantages:

Simplest to implement and operate
No real-time compute required
Can use heavyweight models
Easy debugging (predictions are stored)

Limitations:

Cannot handle new entities (cold start)
Stale predictions between updates
Storage grows with entity count
Cannot incorporate real-time signals

Best for: Recommendations that can be a few hours old, offline scoring, regulatory reporting.

Pattern 2: Precomputed (Cache-Based)

Similar to batch but with finer-grained caching:

    Request ─┬→ Cache Hit? ──Yes──→ Return cached prediction
             │
             └→ No ──→ [Model] ──→ Cache + Return

Best for: High-traffic scenarios with significant request overlap, e.g., popular item recommendations.

Pattern 3: Online Prediction (Synchronous)

Compute predictions on-demand for each request:

    Request → [Feature Fetch] → [Model Inference] → Response
               ~10-50ms           ~10-100ms

Advantages:

Fresh predictions with latest features
Handles any input (no cold start)
No storage scaling with entity count

Limitations:

Latency includes inference time
Requires always-on serving infrastructure
More complex operations

Best for: Real-time personalization, fraud detection, dynamic pricing.

Pattern 4: Streaming Predictions

Process events continuously as they arrive:

    Event Stream → [Stream Processor] → [Model] → Prediction Stream
                     (Flink, Spark)     (embedded)   (Kafka, etc.)

Best for: Near-real-time analytics, continuous monitoring, event-driven predictions.

Pattern 5: Edge Deployment

Run models directly on client devices:

    [Mobile App] ←──model sync──→ [Model Server]
         │
         ▼
    [On-Device Model] → Prediction (local)

Advantages:

Zero network latency for inference
Works offline
Reduced server cost

Limitations:

Model size constraints (memory, download)
Limited compute (no GPU)
Update propagation delay
Difficult to debug/monitor

Best for: Keyboard predictions, on-device image classification, privacy-sensitive applications.

Hybrid Patterns

Production systems often combine patterns. Example: Precompute predictions for the top 1000 users (batch), cache predictions for the next tier (precomputed), and fall back to real-time for long-tail users (online). This optimizes cost and latency across the traffic distribution.

Serving Architecture

For online prediction systems, the serving architecture determines how requests flow from clients to model inference and back. A well-designed architecture handles scale, failure, and evolution gracefully.

Standard Serving Architecture:

    ┌───────────────────────────────────────────────────────────────────────┐
    │                          Load Balancer                                 │
    │                    (Route, rate limit, TLS)                           │
    └───────────────────────────────┬───────────────────────────────────────┘
                                    │
            ┌───────────────────────┼───────────────────────┐
            │                       │                       │
            ▼                       ▼                       ▼
    ┌───────────────┐       ┌───────────────┐       ┌───────────────┐
    │   Gateway     │       │   Gateway     │       │   Gateway     │
    │  (API, Auth)  │       │  (API, Auth)  │       │  (API, Auth)  │
    └───────┬───────┘       └───────┬───────┘       └───────┬───────┘
            │                       │                       │
            └───────────────────────┼───────────────────────┘
                                    │
                                    ▼
    ┌───────────────────────────────────────────────────────────────────────┐
    │                      Feature Service                                   │
    │                 (Fetch features from store)                           │
    └───────────────────────────────┬───────────────────────────────────────┘
                                    │
                                    ▼
    ┌───────────────────────────────────────────────────────────────────────┐
    │                      Model Service                                     │
    │              (Run inference, return predictions)                      │
    │                                                                        │
    │   ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐                 │
    │   │Model A  │  │Model A  │  │Model B  │  │Model B  │   ...          │
    │   │Replica 1│  │Replica 2│  │Replica 1│  │Replica 2│                 │
    │   └─────────┘  └─────────┘  └─────────┘  └─────────┘                 │
    └───────────────────────────────────────────────────────────────────────┘

Architecture Components

•Load Balancer — Distributes traffic, handles TLS, rate limits, routes to healthy instances. Examples: AWS ALB, Envoy, nginx.
•Gateway Layer — Handles authentication, request validation, logging, metrics. May include request transformation.
•Feature Service — Retrieves features from feature stores, databases, or real-time computation. Critical for consistent features.
•Model Service — Runs model inference. May be single model or multi-model. Often uses specialized frameworks (TensorFlow Serving, Triton).
•Model Registry — Stores model artifacts, versioning, metadata. Model service pulls from registry on deployment.

Embedded vs. Standalone Model Servers:

Embedded: Model runs within the application process.

# Application code
from my_model import predict

def handle_request(request):
    features = extract_features(request)
    prediction = predict(features)  # Model runs in same process
    return prediction

Standalone: Model runs in a separate service.

    Application → HTTP/gRPC → Model Server → Prediction

Aspect	Embedded	Standalone
Latency	Lower (no network hop)	Higher (network overhead)
Resource sharing	Model competes with app	Dedicated resources
Scaling	Scale app and model together	Scale independently
Deployment	Redeploy app for model changes	Update model server only
Technology	Limited to app's runtime	Can use specialized serving

Recommendation: Use standalone model servers for:

GPU-accelerated models
Models shared across applications
Large models that need dedicated resources
When you need to update models without redeploying apps

Use embedded for:

Simple models (logistic regression, small trees)
Lowest possible latency requirements
Edge/mobile deployment

Model Server Frameworks

Specialized model serving frameworks handle inference optimization automatically: TensorFlow Serving (TensorFlow models), Triton Inference Server (multi-framework, GPU-optimized), TorchServe (PyTorch), Seldon Core (Kubernetes-native), BentoML (framework-agnostic). These provide batching, GPU management, multi-model serving, and model versioning out of the box.

Scaling Strategies

ML serving systems must scale to handle traffic variability—from baseline load to peak events. Scaling strategy depends on the nature of the workload and the cost of additional capacity.

Horizontal vs. Vertical Scaling:

Horizontal scaling: Add more instances of the model server.

Pros: Linear scaling, fault tolerance (no single point of failure)
Cons: Requires load balancing, state management complexity
Best for: Stateless inference, high throughput requirements

Vertical scaling: Use larger/more powerful instances.

Pros: Simpler architecture, can handle larger models
Cons: Upper limits on instance size, single point of failure
Best for: Large models that don't fit in smaller instances, GPU upgrade paths

Auto-Scaling Patterns:

Reactive auto-scaling: Scale based on current metrics.

    [Metrics: CPU, GPU, latency, queue depth]
              │
              ▼
    [Scaling Policy: if avg_latency > 100ms for 2min, add 2 instances]
              │
              ▼
    [Instance Pool: 4 → 6 instances]

Predictive auto-scaling: Scale based on predicted future load.

    [Historical Traffic] → [Prediction Model] → [Future Load Estimate]
                                                          │
                                                          ▼
    [Pre-scale 30 minutes before predicted spike]

Schedule-based scaling: Pre-defined scaling for known patterns.

    9:00 AM weekdays: 10 instances (morning traffic spike)
    6:00 PM weekdays:  5 instances (evening decline)
    Weekends:          3 instances (lower traffic)

Scaling Metrics for ML:

Metric	What It Indicates	Scaling Response
Request latency (P99)	Processing capacity	Scale up if above SLA
GPU utilization	Compute saturation	Scale up if sustained >80%
Request queue depth	Backlog accumulation	Scale up if growing
Memory utilization	Model/cache fit	May need larger instances
Error rate	System stress	Scale up, investigate errors

Dynamic Batching:

Batching multiple requests into a single inference call dramatically improves GPU efficiency:

    Without batching:
    Request 1 → [GPU] → Response      (20% GPU utilization)
    Request 2 → [GPU] → Response      (20% GPU utilization)
    ...
    
    With batching:
    Request 1 ─┐
    Request 2 ─┼──→ [GPU: batch of 8] → Responses (80% GPU utilization)
    ...       ─┘
    Request 8 ─┘

Batching tradeoffs:

Larger batch → Higher throughput, higher latency (waiting to fill batch)
Smaller batch → Lower latency, lower throughput
Dynamic batching adjusts batch size based on traffic

Batching parameters:

max_batch_size: Maximum requests in a batch
max_wait_time: Maximum time to wait for batch to fill
Balance these against latency SLAs

Example configuration:

# Triton Inference Server batching config
dynamic_batching {
  max_queue_delay_microseconds: 10000  # 10ms max wait
  preferred_batch_size: [4, 8, 16]      # Preferred batch sizes
}

Cold Start Latency

Scaling up introduces cold start latency—time to provision instances, load models, and warm caches. For large models (GB+), loading can take minutes. Strategies: keep minimum replica count, use model preloading, maintain warm standby instances, use smaller model variants for fast scaling.

Hardware Considerations

Model inference has specific hardware requirements that differ from general web services. Understanding these requirements enables cost-effective infrastructure selection.

CPU vs. GPU vs. Specialized Accelerators:

Hardware Selection for ML Inference
Hardware	Best For	Latency	Throughput	Cost
CPU (general)	Small models, tree-based, sparse operations	Low-Medium	Medium	Low
CPU (AVX-512)	Quantized models, optimized inference	Low	Medium-High	Low-Medium
GPU (NVIDIA T4)	Medium neural nets, mixed precision	Low	High	Medium
GPU (NVIDIA A10G)	Large neural nets, transformer models	Low	Very High	High
GPU (NVIDIA A100)	Largest models, maximum throughput	Very Low	Extreme	Very High
AWS Inferentia	Optimized inference, cost-effective	Low	High	Medium-Low
TPU (Google)	TensorFlow models, very large scale	Low	Extreme	High

Hardware Selection Decision Tree:

    Is the model a neural network?
    ├── No (tree-based, linear)
    │   └── Use CPU (optimized libraries like XGBoost, ONNX Runtime)
    │
    └── Yes (neural network)
        │
        └── What's the model size?
            ├── Small (<100MB)
            │   └── CPU may suffice; GPU if high throughput needed
            │
            ├── Medium (100MB-1GB)
            │   └── GPU recommended; consider T4 or Inferentia
            │
            └── Large (>1GB)
                └── GPU required; A10G, A100, or multi-GPU

Memory Considerations:

Models must fit in memory (RAM for CPU, VRAM for GPU):

Model Type	Typical Size	Memory Requirement
Logistic Regression	KB-MB	Minimal
Gradient Boosting	MB-GB	Moderate
Small Neural Net	10-100MB	Moderate
BERT-base	~500MB	High
Large Language Model	10-100GB+	Very High (multi-GPU)

Memory optimization techniques:

Quantization: Reduce precision (FP32 → INT8) to shrink model size 4x
Pruning: Remove unnecessary weights to reduce memory
Model sharing: Load model once, share across requests
Streaming: For huge models, stream weights as needed

Cost Optimization:

Spot/Preemptible instances: For batch inference, use spot instances at 60-90% discount. Requires handling interruptions gracefully.

Right-sizing: Match instance type to model needs. Overprovisioned GPUs waste money; underprovisioned CPUs add latency.

Inference-optimized instances: AWS Inf1 (Inferentia), GCP TPU inference, Azure ML instances are often more cost-effective than general-purpose GPUs.

Cost comparison example:

Configuration	Hourly Cost	Throughput (req/sec)	Cost per 1M requests
8x c5.large (CPU)	$0.68	100	$1.89
2x g4dn.xlarge (T4 GPU)	$1.05	500	$0.58
1x inf1.xlarge (Inferentia)	$0.23	300	$0.21

Costs are illustrative; actual costs vary by model and workload.

Profile Before Provisioning

Always profile model inference on candidate hardware before committing. Theoretical FLOPS don't account for memory bandwidth, data transfer overhead, or framework efficiency. A model that's 2x faster on paper might only be 1.2x faster in practice.

Model Deployment Strategies

Deploying a new model version to production carries risk. A regression in model quality or a bug in serving code can have immediate business impact. Deployment strategies manage this risk through controlled rollout.

Deployment Strategy Comparison:

Model Deployment Strategies
Strategy	Risk Level	Rollback Speed	Best For
Shadow Deployment	None	N/A	Initial validation before any traffic
Canary Deployment	Low	Fast (seconds)	Standard model updates
Blue-Green Deployment	Medium	Fast (seconds)	Major version changes
Rolling Deployment	Medium	Slow (minutes)	Large fleets, stateless models
Feature Flag Deployment	Configurable	Instant	Rapid experimentation

Shadow Deployment:

New model receives real traffic but predictions aren't used:

    Request → [Current Model] → Production Response
                │
                └──→ [New Model] → Log only (compare offline)

Benefits:

Zero production risk
Test with real traffic patterns
Compare predictions between versions
Validate latency and error rates

Canary Deployment:

New model receives a small percentage of traffic:

    Request → [Router: 95% / 5%]
                │        │
                ▼        ▼
        [Model v1.0]  [Model v1.1]
        (production)   (canary)

Progression:

Deploy canary (5% traffic)
Monitor for 1-2 hours
If metrics healthy, increase to 25%
More monitoring
Increase to 50%, then 100%
If any issues, instant rollback to 0% canary

Blue-Green Deployment:

Maintain two complete environments, switch traffic atomically:

    Before:  Traffic → [Blue: v1.0] ← Active
                      [Green: v1.1] ← Standby (new version)
    
    Switch:  Traffic → [Blue: v1.0] ← Standby
                      [Green: v1.1] ← Active

Benefits:

Instant rollback (just switch back)
Full testing on idle environment
No mixed-version serving during deployment

Costs:

Double infrastructure cost during deployment
More complex routing configuration

Deployment Validation Checklist:

Before promoting any model version:

Offline validation passed:
- Evaluation metrics meet thresholds
- No data quality issues in training/validation
- Model artifact integrity verified
Shadow/staging validation passed:
- Latency within SLA (P50, P95, P99)
- No errors on production traffic patterns
- Prediction distributions are reasonable
Canary validation passed:
- Business metrics stable (conversions, engagement)
- Error rates not increased
- No increase in user complaints/feedback
Rollback plan ready:
- Previous version available for instant rollback
- Rollback trigger criteria defined
- On-call team notified of deployment

The Friday Deploy Anti-Pattern

Never deploy model changes before weekends, holidays, or when on-call coverage is reduced. Model issues can take time to surface, and you want full team availability to detect and respond. If you must deploy, use extra-conservative canary percentages and have a clear rollback trigger.

Reliability and Availability

Production serving systems must be reliable—available when needed, returning correct results, and degrading gracefully when issues occur.

Availability Targets:

Target	Downtime per Year	Use Case
99%	3.65 days	Development/staging
99.9%	8.76 hours	Standard production
99.95%	4.38 hours	Business-critical
99.99%	52.6 minutes	Revenue-critical
99.999%	5.26 minutes	Life/safety-critical

Each additional "9" requires significantly more investment in redundancy, monitoring, and incident response.

Reliability Best Practices

•Redundancy — Run multiple replicas across availability zones. No single point of failure.
•Health Checks — Deep health checks that test actual inference, not just process liveness.
•Load Shedding — Under extreme load, reject excess requests cleanly rather than degrading for everyone.
•Circuit Breakers — Stop calling failing dependencies to prevent cascade failures.
•Timeouts — Set aggressive timeouts to prevent request pile-up; use deadline propagation.
•Retries with Backoff — Retry failed requests with exponential backoff and jitter.
•Graceful Degradation — Return fallback predictions when primary model fails.
•Chaos Testing — Regularly inject failures in staging to verify resilience.

Health Check Design:

Shallow vs. deep health checks:

# Shallow: Just checks if process is running
@app.route('/health/live')
def liveness():
    return {'status': 'alive'}

# Deep: Actually runs inference
@app.route('/health/ready')
def readiness():
    try:
        # Run a lightweight inference
        test_prediction = model.predict(CANARY_INPUT)
        
        # Validate prediction is sensible
        assert 0 <= test_prediction <= 1
        
        # Check dependencies
        feature_store_healthy = feature_store.ping()
        
        return {
            'status': 'ready',
            'model_version': model.version,
            'feature_store': feature_store_healthy
        }
    except Exception as e:
        return {'status': 'not_ready', 'error': str(e)}, 503

Use liveness for: Kubernetes restarts when process is stuck. Use readiness for: Load balancer routing; exclude unhealthy instances from rotation.

Failure Domain Isolation:

Isolate failures to prevent cascade:

    ┌─────────────────────────────────────────────────────────────────┐
    │                     Region: us-east-1                           │
    │  ┌──────────────────────┐    ┌──────────────────────┐           │
    │  │     Zone A           │    │     Zone B           │           │
    │  │  ┌────────────────┐  │    │  ┌────────────────┐  │           │
    │  │  │Model Service   │  │    │  │Model Service   │  │           │
    │  │  │Replicas: 3     │  │    │  │Replicas: 3     │  │           │
    │  │  └────────────────┘  │    │  └────────────────┘  │           │
    │  │  ┌────────────────┐  │    │  ┌────────────────┐  │           │
    │  │  │Feature Store   │  │    │  │Feature Store   │  │           │
    │  │  │(Read Replica)  │  │    │  │(Read Replica)  │  │           │
    │  │  └────────────────┘  │    │  └────────────────┘  │           │
    │  └──────────────────────┘    └──────────────────────┘           │
    └─────────────────────────────────────────────────────────────────┘

Zone failure affects only that zone; other zones continue serving.

Define Error Budgets

Error budgets make reliability vs. velocity tradeoffs explicit. If your SLA is 99.9% availability, you have 0.1% error budget per month (~43 minutes). Spend it on deploys and experiments; when budget is exhausted, freeze changes and focus on stability. This aligns engineering incentives with reliability goals.

Operational Considerations

Running ML serving in production requires ongoing operational attention. Beyond the initial deployment, teams must handle incidents, perform maintenance, and continuously improve.

Essential Operational Capabilities:

Operational Runbook Categories
Category	Example Scenarios	Required Capabilities
Incident Response	Model latency spike, error rate increase	Alert routing, runbooks, rollback procedures
Capacity Management	Traffic growth, seasonal spikes	Capacity planning, auto-scaling, load testing
Change Management	Model updates, infrastructure changes	Deployment automation, approval workflows
Maintenance	Dependency updates, infrastructure patching	Maintenance windows, zero-downtime updates
Disaster Recovery	Region failure, data corruption	Backup/restore procedures, DR drills

On-Call Best Practices:

Runbooks for common issues: Document steps for frequent alerts. On-call shouldn't require deep investigation at 3 AM.
Escalation paths: Clear escalation from on-call → team lead → domain expert → management.
Blameless postmortems: Focus on system improvements, not individual blame. What can we change to prevent recurrence?
Alert hygiene: Every alert should be actionable. If alerts are frequently noise, they get ignored.
Rotation and coverage: Fair on-call rotation with adequate rest between shifts. Cover for holidays and time zones.

Logging and Debugging:

Production debugging requires comprehensive logging:

def predict(request):
    request_id = generate_request_id()
    
    logger.info("Prediction request received", extra={
        'request_id': request_id,
        'user_id': request.user_id,
        'model_version': model.version,
        'feature_count': len(request.features),
    })
    
    start_time = time.time()
    
    # Feature retrieval
    features = feature_store.get_features(request)
    feature_time = time.time() - start_time
    
    # Model inference
    prediction = model.predict(features)
    inference_time = time.time() - start_time - feature_time
    
    logger.info("Prediction completed", extra={
        'request_id': request_id,
        'prediction': prediction,
        'feature_retrieval_ms': feature_time * 1000,
        'inference_ms': inference_time * 1000,
        'total_ms': (time.time() - start_time) * 1000,
    })
    
    return prediction

Log query examples:

"Show me all predictions for user X in the last hour"
"Show me the slowest 1% of predictions today"
"Show me predictions that returned error in the last 5 minutes"

Cost Management:

ML serving can be expensive. Ongoing cost optimization is essential:

Right-size instances: Monitor actual utilization; downsize over-provisioned resources.
Use spot/preemptible for batch: Batch prediction doesn't need on-demand reliability.
Optimize model efficiency: Quantization, pruning, and distillation reduce inference cost.
Reduce unnecessary predictions: Cache where possible; batch where possible.
Multi-tenancy: Share infrastructure across models and teams where feasible.

Cost visibility:

Track cost per model, per environment
Alert on unexpected cost increases
Attribute costs to business use cases

Example cost dashboard metrics:

Cost per 1M predictions (by model)
Monthly cost trend (by service)
Cost per user/customer (attribution)
Cost efficiency (predictions per dollar)

Build vs. Buy

Building and operating ML serving infrastructure is substantial ongoing effort. Managed services (AWS SageMaker, GCP Vertex AI, Azure ML) handle much of this complexity at premium cost. Evaluate build vs. buy based on: team size, expertise, scale requirements, customization needs, and long-term cost projections.

Summary: Serving Infrastructure Mastery

Serving infrastructure transforms trained models into production prediction services. The challenges span distributed systems, hardware optimization, deployment safety, and operational excellence. Success requires combining ML understanding with systems engineering expertise.

Key Takeaways

•Choose the right serving pattern — Batch, precomputed, online, streaming, or edge. Let requirements guide the choice, not technology enthusiasm.
•Design for scale from the start — Horizontal scaling, auto-scaling, and dynamic batching enable handling variable traffic without manual intervention.
•Match hardware to workload — GPUs for neural networks at scale; CPUs for simple models. Profile actual performance; don't rely on specifications.
•Deploy incrementally — Shadow, canary, blue-green deployments manage risk. Never ship directly to 100% traffic.
•Plan for failure — Redundancy, health checks, circuit breakers, and graceful degradation. Assume everything will fail eventually.
•Operate intentionally — Runbooks, on-call processes, logging, and cost management are as important as initial deployment.
•Measure everything — Latency, throughput, availability, error rates, and cost. You can't improve what you don't measure.

What's next:

Deploying models is not the end—it's the beginning of ongoing operation. The next page explores Monitoring and Maintenance—how to detect model degradation, handle drift, manage retraining, and ensure your ML system continues to perform over time.

Page Complete

You now understand how to design and operate serving infrastructure for production ML systems—from serving patterns and architecture through scaling, deployment, reliability, and operations. Next, we ensure these systems continue to perform through monitoring and maintenance.

4 / 5

Loading learning content...