Loading learning content...
A model sitting in a notebook produces no value. Value is created when that model serves predictions to users, applications, and downstream systems—reliably, quickly, and at scale. This is the domain of serving infrastructure.
Serving infrastructure is where ML meets distributed systems engineering. The challenges are familiar to anyone who has built scalable web services—load balancing, caching, failure handling, resource management—but ML adds unique wrinkles: large model sizes, GPU acceleration, dynamic batching, and the constant evolution of models through retraining and experimentation.
This page covers the principles and practices for building serving infrastructure that transforms trained models into production prediction services.
By the end of this page, you will understand how to deploy and operate ML models in production—from serving patterns and scaling strategies to hardware selection and operational best practices. You'll learn to design serving infrastructure that meets latency, throughput, and availability requirements.
Different use cases require different serving patterns. The right pattern depends on latency requirements, traffic patterns, and integration constraints.
The Serving Pattern Spectrum:
| Pattern | Latency | When to Update | Use Cases | Complexity |
|---|---|---|---|---|
| Batch Prediction | Hours | On schedule | Offline analytics, daily recommendations | Low |
| Precomputed | Milliseconds | On schedule/trigger | Limited input space, cacheable results | Low-Medium |
| Online Prediction | Milliseconds | Real-time | User-facing features, real-time decisions | Medium-High |
| Streaming | Seconds-Minutes | Continuous | Event processing, real-time aggregations | High |
| Edge Deployment | Milliseconds | Periodic sync | Mobile, IoT, offline-capable | High |
Pattern 1: Batch Prediction
Generate predictions for all entities on a schedule, store results:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ All Users │ ──→ │ Model │ ──→ │ Predictions │
│ (from DB) │ │ (batch job) │ │ (to DB) │
└─────────────┘ └─────────────┘ └─────────────┘
Runs nightly
At serving time:
Request → Database lookup → Return precomputed prediction
Advantages:
Limitations:
Best for: Recommendations that can be a few hours old, offline scoring, regulatory reporting.
Pattern 2: Precomputed (Cache-Based)
Similar to batch but with finer-grained caching:
Request ─┬→ Cache Hit? ──Yes──→ Return cached prediction
│
└→ No ──→ [Model] ──→ Cache + Return
Best for: High-traffic scenarios with significant request overlap, e.g., popular item recommendations.
Pattern 3: Online Prediction (Synchronous)
Compute predictions on-demand for each request:
Request → [Feature Fetch] → [Model Inference] → Response
~10-50ms ~10-100ms
Advantages:
Limitations:
Best for: Real-time personalization, fraud detection, dynamic pricing.
Pattern 4: Streaming Predictions
Process events continuously as they arrive:
Event Stream → [Stream Processor] → [Model] → Prediction Stream
(Flink, Spark) (embedded) (Kafka, etc.)
Best for: Near-real-time analytics, continuous monitoring, event-driven predictions.
Pattern 5: Edge Deployment
Run models directly on client devices:
[Mobile App] ←──model sync──→ [Model Server]
│
▼
[On-Device Model] → Prediction (local)
Advantages:
Limitations:
Best for: Keyboard predictions, on-device image classification, privacy-sensitive applications.
Production systems often combine patterns. Example: Precompute predictions for the top 1000 users (batch), cache predictions for the next tier (precomputed), and fall back to real-time for long-tail users (online). This optimizes cost and latency across the traffic distribution.
For online prediction systems, the serving architecture determines how requests flow from clients to model inference and back. A well-designed architecture handles scale, failure, and evolution gracefully.
Standard Serving Architecture:
┌───────────────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ (Route, rate limit, TLS) │
└───────────────────────────────┬───────────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Gateway │ │ Gateway │ │ Gateway │
│ (API, Auth) │ │ (API, Auth) │ │ (API, Auth) │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────┐
│ Feature Service │
│ (Fetch features from store) │
└───────────────────────────────┬───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────┐
│ Model Service │
│ (Run inference, return predictions) │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Model A │ │Model A │ │Model B │ │Model B │ ... │
│ │Replica 1│ │Replica 2│ │Replica 1│ │Replica 2│ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└───────────────────────────────────────────────────────────────────────┘
Embedded vs. Standalone Model Servers:
Embedded: Model runs within the application process.
# Application code
from my_model import predict
def handle_request(request):
features = extract_features(request)
prediction = predict(features) # Model runs in same process
return prediction
Standalone: Model runs in a separate service.
Application → HTTP/gRPC → Model Server → Prediction
| Aspect | Embedded | Standalone |
|---|---|---|
| Latency | Lower (no network hop) | Higher (network overhead) |
| Resource sharing | Model competes with app | Dedicated resources |
| Scaling | Scale app and model together | Scale independently |
| Deployment | Redeploy app for model changes | Update model server only |
| Technology | Limited to app's runtime | Can use specialized serving |
Recommendation: Use standalone model servers for:
Use embedded for:
Specialized model serving frameworks handle inference optimization automatically: TensorFlow Serving (TensorFlow models), Triton Inference Server (multi-framework, GPU-optimized), TorchServe (PyTorch), Seldon Core (Kubernetes-native), BentoML (framework-agnostic). These provide batching, GPU management, multi-model serving, and model versioning out of the box.
ML serving systems must scale to handle traffic variability—from baseline load to peak events. Scaling strategy depends on the nature of the workload and the cost of additional capacity.
Horizontal vs. Vertical Scaling:
Horizontal scaling: Add more instances of the model server.
Vertical scaling: Use larger/more powerful instances.
Auto-Scaling Patterns:
Reactive auto-scaling: Scale based on current metrics.
[Metrics: CPU, GPU, latency, queue depth]
│
▼
[Scaling Policy: if avg_latency > 100ms for 2min, add 2 instances]
│
▼
[Instance Pool: 4 → 6 instances]
Predictive auto-scaling: Scale based on predicted future load.
[Historical Traffic] → [Prediction Model] → [Future Load Estimate]
│
▼
[Pre-scale 30 minutes before predicted spike]
Schedule-based scaling: Pre-defined scaling for known patterns.
9:00 AM weekdays: 10 instances (morning traffic spike)
6:00 PM weekdays: 5 instances (evening decline)
Weekends: 3 instances (lower traffic)
Scaling Metrics for ML:
| Metric | What It Indicates | Scaling Response |
|---|---|---|
| Request latency (P99) | Processing capacity | Scale up if above SLA |
| GPU utilization | Compute saturation | Scale up if sustained >80% |
| Request queue depth | Backlog accumulation | Scale up if growing |
| Memory utilization | Model/cache fit | May need larger instances |
| Error rate | System stress | Scale up, investigate errors |
Dynamic Batching:
Batching multiple requests into a single inference call dramatically improves GPU efficiency:
Without batching:
Request 1 → [GPU] → Response (20% GPU utilization)
Request 2 → [GPU] → Response (20% GPU utilization)
...
With batching:
Request 1 ─┐
Request 2 ─┼──→ [GPU: batch of 8] → Responses (80% GPU utilization)
... ─┘
Request 8 ─┘
Batching tradeoffs:
Batching parameters:
max_batch_size: Maximum requests in a batchmax_wait_time: Maximum time to wait for batch to fillExample configuration:
# Triton Inference Server batching config
dynamic_batching {
max_queue_delay_microseconds: 10000 # 10ms max wait
preferred_batch_size: [4, 8, 16] # Preferred batch sizes
}
Scaling up introduces cold start latency—time to provision instances, load models, and warm caches. For large models (GB+), loading can take minutes. Strategies: keep minimum replica count, use model preloading, maintain warm standby instances, use smaller model variants for fast scaling.
Model inference has specific hardware requirements that differ from general web services. Understanding these requirements enables cost-effective infrastructure selection.
CPU vs. GPU vs. Specialized Accelerators:
| Hardware | Best For | Latency | Throughput | Cost |
|---|---|---|---|---|
| CPU (general) | Small models, tree-based, sparse operations | Low-Medium | Medium | Low |
| CPU (AVX-512) | Quantized models, optimized inference | Low | Medium-High | Low-Medium |
| GPU (NVIDIA T4) | Medium neural nets, mixed precision | Low | High | Medium |
| GPU (NVIDIA A10G) | Large neural nets, transformer models | Low | Very High | High |
| GPU (NVIDIA A100) | Largest models, maximum throughput | Very Low | Extreme | Very High |
| AWS Inferentia | Optimized inference, cost-effective | Low | High | Medium-Low |
| TPU (Google) | TensorFlow models, very large scale | Low | Extreme | High |
Hardware Selection Decision Tree:
Is the model a neural network?
├── No (tree-based, linear)
│ └── Use CPU (optimized libraries like XGBoost, ONNX Runtime)
│
└── Yes (neural network)
│
└── What's the model size?
├── Small (<100MB)
│ └── CPU may suffice; GPU if high throughput needed
│
├── Medium (100MB-1GB)
│ └── GPU recommended; consider T4 or Inferentia
│
└── Large (>1GB)
└── GPU required; A10G, A100, or multi-GPU
Memory Considerations:
Models must fit in memory (RAM for CPU, VRAM for GPU):
| Model Type | Typical Size | Memory Requirement |
|---|---|---|
| Logistic Regression | KB-MB | Minimal |
| Gradient Boosting | MB-GB | Moderate |
| Small Neural Net | 10-100MB | Moderate |
| BERT-base | ~500MB | High |
| Large Language Model | 10-100GB+ | Very High (multi-GPU) |
Memory optimization techniques:
Cost Optimization:
Spot/Preemptible instances: For batch inference, use spot instances at 60-90% discount. Requires handling interruptions gracefully.
Right-sizing: Match instance type to model needs. Overprovisioned GPUs waste money; underprovisioned CPUs add latency.
Inference-optimized instances: AWS Inf1 (Inferentia), GCP TPU inference, Azure ML instances are often more cost-effective than general-purpose GPUs.
Cost comparison example:
| Configuration | Hourly Cost | Throughput (req/sec) | Cost per 1M requests |
|---|---|---|---|
| 8x c5.large (CPU) | $0.68 | 100 | $1.89 |
| 2x g4dn.xlarge (T4 GPU) | $1.05 | 500 | $0.58 |
| 1x inf1.xlarge (Inferentia) | $0.23 | 300 | $0.21 |
Costs are illustrative; actual costs vary by model and workload.
Always profile model inference on candidate hardware before committing. Theoretical FLOPS don't account for memory bandwidth, data transfer overhead, or framework efficiency. A model that's 2x faster on paper might only be 1.2x faster in practice.
Deploying a new model version to production carries risk. A regression in model quality or a bug in serving code can have immediate business impact. Deployment strategies manage this risk through controlled rollout.
Deployment Strategy Comparison:
| Strategy | Risk Level | Rollback Speed | Best For |
|---|---|---|---|
| Shadow Deployment | None | N/A | Initial validation before any traffic |
| Canary Deployment | Low | Fast (seconds) | Standard model updates |
| Blue-Green Deployment | Medium | Fast (seconds) | Major version changes |
| Rolling Deployment | Medium | Slow (minutes) | Large fleets, stateless models |
| Feature Flag Deployment | Configurable | Instant | Rapid experimentation |
Shadow Deployment:
New model receives real traffic but predictions aren't used:
Request → [Current Model] → Production Response
│
└──→ [New Model] → Log only (compare offline)
Benefits:
Canary Deployment:
New model receives a small percentage of traffic:
Request → [Router: 95% / 5%]
│ │
▼ ▼
[Model v1.0] [Model v1.1]
(production) (canary)
Progression:
Blue-Green Deployment:
Maintain two complete environments, switch traffic atomically:
Before: Traffic → [Blue: v1.0] ← Active
[Green: v1.1] ← Standby (new version)
Switch: Traffic → [Blue: v1.0] ← Standby
[Green: v1.1] ← Active
Benefits:
Costs:
Deployment Validation Checklist:
Before promoting any model version:
Offline validation passed:
Shadow/staging validation passed:
Canary validation passed:
Rollback plan ready:
Never deploy model changes before weekends, holidays, or when on-call coverage is reduced. Model issues can take time to surface, and you want full team availability to detect and respond. If you must deploy, use extra-conservative canary percentages and have a clear rollback trigger.
Production serving systems must be reliable—available when needed, returning correct results, and degrading gracefully when issues occur.
Availability Targets:
| Target | Downtime per Year | Use Case |
|---|---|---|
| 99% | 3.65 days | Development/staging |
| 99.9% | 8.76 hours | Standard production |
| 99.95% | 4.38 hours | Business-critical |
| 99.99% | 52.6 minutes | Revenue-critical |
| 99.999% | 5.26 minutes | Life/safety-critical |
Each additional "9" requires significantly more investment in redundancy, monitoring, and incident response.
Health Check Design:
Shallow vs. deep health checks:
# Shallow: Just checks if process is running
@app.route('/health/live')
def liveness():
return {'status': 'alive'}
# Deep: Actually runs inference
@app.route('/health/ready')
def readiness():
try:
# Run a lightweight inference
test_prediction = model.predict(CANARY_INPUT)
# Validate prediction is sensible
assert 0 <= test_prediction <= 1
# Check dependencies
feature_store_healthy = feature_store.ping()
return {
'status': 'ready',
'model_version': model.version,
'feature_store': feature_store_healthy
}
except Exception as e:
return {'status': 'not_ready', 'error': str(e)}, 503
Use liveness for: Kubernetes restarts when process is stuck. Use readiness for: Load balancer routing; exclude unhealthy instances from rotation.
Failure Domain Isolation:
Isolate failures to prevent cascade:
┌─────────────────────────────────────────────────────────────────┐
│ Region: us-east-1 │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Zone A │ │ Zone B │ │
│ │ ┌────────────────┐ │ │ ┌────────────────┐ │ │
│ │ │Model Service │ │ │ │Model Service │ │ │
│ │ │Replicas: 3 │ │ │ │Replicas: 3 │ │ │
│ │ └────────────────┘ │ │ └────────────────┘ │ │
│ │ ┌────────────────┐ │ │ ┌────────────────┐ │ │
│ │ │Feature Store │ │ │ │Feature Store │ │ │
│ │ │(Read Replica) │ │ │ │(Read Replica) │ │ │
│ │ └────────────────┘ │ │ └────────────────┘ │ │
│ └──────────────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Zone failure affects only that zone; other zones continue serving.
Error budgets make reliability vs. velocity tradeoffs explicit. If your SLA is 99.9% availability, you have 0.1% error budget per month (~43 minutes). Spend it on deploys and experiments; when budget is exhausted, freeze changes and focus on stability. This aligns engineering incentives with reliability goals.
Running ML serving in production requires ongoing operational attention. Beyond the initial deployment, teams must handle incidents, perform maintenance, and continuously improve.
Essential Operational Capabilities:
| Category | Example Scenarios | Required Capabilities |
|---|---|---|
| Incident Response | Model latency spike, error rate increase | Alert routing, runbooks, rollback procedures |
| Capacity Management | Traffic growth, seasonal spikes | Capacity planning, auto-scaling, load testing |
| Change Management | Model updates, infrastructure changes | Deployment automation, approval workflows |
| Maintenance | Dependency updates, infrastructure patching | Maintenance windows, zero-downtime updates |
| Disaster Recovery | Region failure, data corruption | Backup/restore procedures, DR drills |
On-Call Best Practices:
Runbooks for common issues: Document steps for frequent alerts. On-call shouldn't require deep investigation at 3 AM.
Escalation paths: Clear escalation from on-call → team lead → domain expert → management.
Blameless postmortems: Focus on system improvements, not individual blame. What can we change to prevent recurrence?
Alert hygiene: Every alert should be actionable. If alerts are frequently noise, they get ignored.
Rotation and coverage: Fair on-call rotation with adequate rest between shifts. Cover for holidays and time zones.
Logging and Debugging:
Production debugging requires comprehensive logging:
def predict(request):
request_id = generate_request_id()
logger.info("Prediction request received", extra={
'request_id': request_id,
'user_id': request.user_id,
'model_version': model.version,
'feature_count': len(request.features),
})
start_time = time.time()
# Feature retrieval
features = feature_store.get_features(request)
feature_time = time.time() - start_time
# Model inference
prediction = model.predict(features)
inference_time = time.time() - start_time - feature_time
logger.info("Prediction completed", extra={
'request_id': request_id,
'prediction': prediction,
'feature_retrieval_ms': feature_time * 1000,
'inference_ms': inference_time * 1000,
'total_ms': (time.time() - start_time) * 1000,
})
return prediction
Log query examples:
Cost Management:
ML serving can be expensive. Ongoing cost optimization is essential:
Right-size instances: Monitor actual utilization; downsize over-provisioned resources.
Use spot/preemptible for batch: Batch prediction doesn't need on-demand reliability.
Optimize model efficiency: Quantization, pruning, and distillation reduce inference cost.
Reduce unnecessary predictions: Cache where possible; batch where possible.
Multi-tenancy: Share infrastructure across models and teams where feasible.
Cost visibility:
Example cost dashboard metrics:
Building and operating ML serving infrastructure is substantial ongoing effort. Managed services (AWS SageMaker, GCP Vertex AI, Azure ML) handle much of this complexity at premium cost. Evaluate build vs. buy based on: team size, expertise, scale requirements, customization needs, and long-term cost projections.
Serving infrastructure transforms trained models into production prediction services. The challenges span distributed systems, hardware optimization, deployment safety, and operational excellence. Success requires combining ML understanding with systems engineering expertise.
What's next:
Deploying models is not the end—it's the beginning of ongoing operation. The next page explores Monitoring and Maintenance—how to detect model degradation, handle drift, manage retraining, and ensure your ML system continues to perform over time.
You now understand how to design and operate serving infrastructure for production ML systems—from serving patterns and architecture through scaling, deployment, reliability, and operations. Next, we ensure these systems continue to perform through monitoring and maintenance.