Loading content...
An auto-scaling system is only as good as the signals it responds to. Choose the wrong metric, and your system will either scale too late (causing outages), too early (wasting money), or erratically (oscillating between states). Choose the right metric, and your system will feel like it's reading users' minds—scaling up moments before traffic spikes and gracefully contracting as demand wanes.
This page explores scaling triggers: the metrics and signals that auto-scaling policies monitor to make capacity decisions. We'll examine infrastructure metrics (CPU, memory), throughput metrics (request rate, queue depth), latency metrics, and custom application-specific signals. More importantly, we'll develop the judgment to choose the right trigger for your specific workload.
By the end of this page, you will understand the full taxonomy of scaling triggers, their strengths and weaknesses, how they behave under different workload patterns, and how to select optimal triggers for various system types. You'll be able to design metric-based scaling strategies that maintain performance while minimizing cost.
Scaling triggers fall into several categories, each with distinct characteristics. Understanding this taxonomy helps you navigate the options and choose appropriately.
| Category | Examples | Best For | Limitations |
|---|---|---|---|
| Infrastructure Metrics | CPU utilization, memory usage, network I/O, disk I/O | General-purpose workloads, compute-bound services | Lagging indicator; may not reflect user experience |
| Throughput Metrics | Requests per second, messages/second, transactions/second | API servers, event processors, web applications | Doesn't account for request complexity variation |
| Queue-Based Metrics | Queue depth, message backlog, pending tasks | Async workers, batch processors, event consumers | Can cause over-provisioning for bursty patterns |
| Latency Metrics | Response time p50/p95/p99, processing duration | User-facing services, SLA-bound systems | Requires careful threshold tuning; can oscillate |
| Custom Application Metrics | Active users, concurrent connections, business transactions | Application-specific patterns, complex dependencies | Requires instrumentation; harder to standardize |
| Composite/Derived Metrics | CPU × queue depth, weighted SLA score | Multi-dimensional workloads, sophisticated systems | Complex to configure and debug |
The most effective scaling triggers are leading indicators—metrics that rise before user experience degrades. CPU utilization is a lagging indicator (it's already high when users are suffering). Queue depth is a leading indicator (a growing queue predicts future latency increases). Whenever possible, choose leading indicators.
CPU utilization is the most common scaling trigger, chosen as the default in almost every cloud platform's auto-scaling configuration wizard. This prevalence has both good reasons and significant pitfalls.
How CPU-Based Scaling Works:
The Math Behind Target Tracking:
For CPU-based target tracking, the formula is approximately:
Desired Capacity = Current Capacity × (Current CPU / Target CPU)
Example: You have 10 instances at 90% CPU with a target of 60%:
Desired = 10 × (90 / 60) = 10 × 1.5 = 15 instances
The system would scale out to 15 instances, expecting the new capacity to distribute load and reduce average CPU to ~60%.
Many engineers default to a 70% CPU target without understanding why. The optimal target depends on your workload's variance. For steady traffic, 70-80% is reasonable. For bursty traffic, 40-50% provides headroom for spikes. For latency-sensitive services, even lower targets may be needed. There is no universal 'right' number—you must measure and tune.
Practical Considerations:
Aggregation Period: Using 1-minute average smooths noise but adds latency. Using 10-second samples is reactive but prone to false positives. Balance based on your traffic pattern.
Aggregation Function: Average is common but can hide problems. If you have 10 instances and one is at 100% while nine are at 50%, the average is 55%—no alarm triggers, but requests to the hot instance are suffering.
Instance Heterogeneity: If your scaling group has mixed instance types, CPU percentages aren't directly comparable. Normalize by compute capacity.
Reservation/Steal Effects: On shared infrastructure, CPU steal (time stolen by hypervisor) can make utilization appear lower than actual demand. Monitor for this.
Memory utilization is the second-most common infrastructure metric for scaling, but it behaves very differently from CPU and requires careful consideration.
Key Differences from CPU:
Memory is gradual, not immediate — Memory usage tends to grow slowly as connections accumulate, caches fill, or objects are allocated. Unlike CPU which can spike instantly, memory follows slower patterns.
Memory exhaustion is catastrophic — When CPU hits 100%, latency increases but the system functions. When memory hits 100%, the OOM killer terminates processes, often crashing your service.
Memory doesn't naturally release — CPU utilization drops immediately when load decreases. Memory often requires garbage collection, connection teardown, or explicit release—meaning it can stay high even after load decreases.
Interpretation is complex — High memory might mean:
This ambiguity makes memory a tricky scaling trigger.
Practical Configuration:
Unlike CPU where 70-80% targets are common, memory targets should typically be lower:
| Workload Type | Recommended Memory Target | Rationale |
|---|---|---|
| General services | 60-70% | Leave headroom for traffic spikes and GC |
| JVM-based services | 70-80% | JVM manages heap; focus on heap metrics instead |
| Memory-intensive | 50-60% | Memory-bound services need more headroom |
| Containers (K8s) | 70-80% | Memory limits cause OOMKill; stay below |
Memory-Specific Considerations:
Distinguish between resident and virtual memory — RSS (resident set size) matters more than virtual memory
Account for kernel buffers and caches — Linux uses free memory for filesystem caches; this isn't a problem. Use 'available' memory, not 'free' memory.
Container memory accounting — In containers, memory limits are hard boundaries. Approaching the limit risks OOMKill.
JVM heap management — For JVM applications, monitor heap utilization rather than system memory. Consider scaling on GC pressure metrics.
Scaling based on memory without addressing memory leaks is futile. If your application has a memory leak, adding instances temporarily decreases per-instance memory—until the new instances also leak. You'll keep scaling until you hit maximum capacity, then crash. Always investigate high memory before assuming it's load-related.
For asynchronous and event-driven systems, queue depth (also called backlog or pending messages) is often the most effective scaling trigger. It's a leading indicator that directly reflects the gap between incoming work and processing capacity.
Why Queue Depth Is Powerful:
Direct relationship to capacity need — If your queue is growing, you definitionally need more processing capacity. There's no interpretation needed.
Leading indicator — A growing queue predicts future latency/delay. You can scale before users are affected.
Works when CPU doesn't — For I/O-bound workers that spend time waiting for external systems, CPU may be low even when overwhelmed. Queue depth reveals the true situation.
Natural load shedding — Queues provide built-in buffering, smoothing sudden spikes and giving scaling time to respond.
The Queue-Based Scaling Model:
Queue-based scaling follows a simple principle:
Desired Workers = Queue Depth / (Acceptable Messages Per Worker)
Example: You have 10,000 messages in queue, and each worker processes 1,000 messages before the SLA is violated:
Desired = 10,000 / 1,000 = 10 workers
This is implemented in AWS as "Target Tracking with SQS Queue Depth" and similar mechanisms on other platforms.
Queue Depth Variations:
| Queue System | Depth Metric | Notes |
|---|---|---|
| AWS SQS | ApproximateNumberOfMessagesVisible | Polled every ~1 minute; use CloudWatch |
| Kafka | Consumer Lag (per partition) | Sum of (latest offset - consumer offset) per partition |
| RabbitMQ | messages_ready | Available in management API |
| Redis Streams | XLEN minus XREADGROUP pending | Track consumer group progress |
| Azure Service Bus | MessageCount | Per queue/subscription |
| GCP Pub/Sub | num_undelivered_messages | Available via Stackdriver |
Advanced: Age of Oldest Message
Beyond depth, consider age of oldest message—how long the oldest item has been waiting. This metric directly reflects latency SLA compliance. If your SLA is "process within 5 minutes" and the oldest message is 4 minutes old, scaling is urgent even if depth is moderate.
If your queue receives sudden bursts (e.g., batch job drops 100,000 messages at once), queue-depth scaling can dramatically over-provision. By the time 50 new workers launch, the burst might be processed by existing workers. Use scaling rate limits and cool-down periods to prevent this.
For synchronous services (REST APIs, gRPC services, web applications), request rate is often a better scaling trigger than CPU because it directly measures the work arriving at your system.
Request-Based Scaling (Target Tracking):
The principle is simple: define how many requests per second each instance should handle, and scale to maintain that target.
Desired Capacity = Current RPS / Target RPS Per Instance
Example: You're receiving 5,000 requests/second, and each instance comfortably handles 500 req/s:
Desired = 5,000 / 500 = 10 instances
Why Request Rate Works Well:
Implementing Request-Based Scaling:
AWS Application Load Balancer + Target Tracking:
Metric: ALBRequestCountPerTarget
Target: 1000 (requests/instance/minute)
Scale-out cooldown: 60 seconds
Scale-in cooldown: 300 seconds
Kubernetes HPA + Custom Metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 100
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 1000
Determining Target RPS:
The challenging part is determining the right requests/instance target. Approaches:
If your requests vary dramatically in cost (some take 10ms, others take 10 seconds), pure request rate scaling will be inaccurate. Consider weighting by request type, using latency as a secondary signal, or separating endpoints into different scaling groups with distinct targets.
Latency metrics (response time, processing duration) are the most user-relevant signals—they directly measure what users experience. Scaling on latency is powerful but requires careful configuration.
Why Latency Is the Ultimate Metric:
The Challenge:
Latency-based scaling is tricky because:
Practical Implementation:
Latency-based scaling typically uses step scaling or target tracking with latency percentiles:
Step Scaling Example:
- p99 latency 200-300ms: add 10% capacity
- p99 latency 300-500ms: add 25% capacity
- p99 latency > 500ms: add 50% capacity
- p99 latency < 150ms for 10 min: remove 10% capacity
Latency Metric Sources:
| Source | Metric | Notes |
|---|---|---|
| Load Balancer | TargetResponseTime | Excludes network to user; measures service time |
| APM (DataDog, New Relic) | Transaction duration | Full distributed trace; excludes only client network |
| Application metrics | Custom timers | Most accurate but requires instrumentation |
| Prometheus | histogram_quantile | Built-in percentile aggregation |
Latency-based scaling can oscillate: high latency → scale out → latency drops → scale in → high latency again. Prevent this with asymmetric cooldowns (long scale-in, short scale-out), hysteresis (different thresholds for up vs down), and stabilization windows (K8s HPA v2 feature).
Sometimes infrastructure metrics don't capture what matters. Custom application metrics let you scale based on business-relevant signals that only your application understands.
Implementing Custom Metrics Scaling:
Step 1: Expose the Metric
Your application must publish the metric to a monitoring system:
# Python with Prometheus client
from prometheus_client import Gauge
active_connections = Gauge('websocket_active_connections',
'Number of active WebSocket connections')
def on_connect(ws):
active_connections.inc()
def on_disconnect(ws):
active_connections.dec()
Step 2: Configure External Metrics in Auto-Scaler
For Kubernetes HPA:
metrics:
- type: External
external:
metric:
name: websocket_active_connections
selector:
matchLabels:
service: chat-server
target:
type: Value
value: 1000 # connections per pod
For AWS:
Step 3: Tune Based on Behavior
Custom metrics require experimentation. The relationship between metric value and required capacity isn't always linear. Monitor, adjust targets, and iterate.
Custom metrics require instrumentation—code changes to expose metrics, monitoring infrastructure to collect them, and integration with auto-scaling systems. This investment pays off for systems where infrastructure metrics are poor proxies for load, but don't over-engineer. Start with standard metrics; add custom ones when they demonstrably improve scaling behavior.
With all these options, how do you choose? The answer depends on your workload's characteristics. Here's a decision framework:
| Workload Type | Primary Trigger | Secondary Trigger | Why |
|---|---|---|---|
| Compute-bound API servers | CPU utilization (50-70%) | Request rate | CPU directly measures compute demand |
| I/O-bound API servers | Request rate or latency | Active connections | CPU is low even when overloaded |
| Async message consumers | Queue depth | Consumer lag age | Queue size directly measures pending work |
| Batch processing workers | Queue depth / job count | CPU (to avoid waste) | Job count determines capacity need |
| WebSocket/real-time servers | Active connections | Memory | Connections consume memory, not CPU |
| ML inference services | GPU utilization | Request queue depth | GPU is the scarce resource |
| Database read replicas | Replication lag OR read IOPS | CPU | Replica lag hurts consistency; IOPS indicates read load |
| Cache layer | Memory utilization + hit rate | Request rate | Cache effectiveness determines value |
The Multi-Signal Approach:
Mature systems often use multiple scaling triggers with the following logic:
If (CPU > 70%) OR (RequestRate/Instance > 500) OR (QueueDepth > 10000):
Scale Out
If (CPU < 30%) AND (RequestRate/Instance < 200) AND (QueueDepth < 1000):
Scale In (after cooldown)
This approach:
The Composite Metric Pattern:
Advanced operators create derived metrics that combine multiple signals:
load_score = 0.4 * normalize(cpu)
+ 0.3 * normalize(request_rate)
+ 0.3 * normalize(queue_depth)
Scale on: target load_score = 0.6
This weighted approach captures multi-dimensional load in a single metric, simplifying policy configuration.
Don't over-engineer initial scaling configuration. Start with CPU (most common) or the single most relevant metric. Observe behavior under real traffic. Add complexity only when simple approaches demonstrably fail. Most production systems do fine with one well-tuned trigger.
We've explored the full landscape of scaling triggers. Let's consolidate the key insights:
What's Next:
Now that we know what to measure, we need to define how to respond. The next page explores scaling policies—the rules that translate metric observations into scaling actions. We'll cover target tracking, step scaling, simple scaling, and scheduled scaling.
You now understand the full taxonomy of scaling triggers, from infrastructure metrics like CPU and memory to throughput-based, queue-based, latency-based, and custom application metrics. You can analyze a workload and select appropriate triggers that will drive effective, cost-efficient scaling behavior.