Auto Scaling - Learning Module

Loading content...

0/273

Scaling Triggers: CPU, Memory, Queue Depth

The Art of Choosing What to Measure

An auto-scaling system is only as good as the signals it responds to. Choose the wrong metric, and your system will either scale too late (causing outages), too early (wasting money), or erratically (oscillating between states). Choose the right metric, and your system will feel like it's reading users' minds—scaling up moments before traffic spikes and gracefully contracting as demand wanes.

This page explores scaling triggers: the metrics and signals that auto-scaling policies monitor to make capacity decisions. We'll examine infrastructure metrics (CPU, memory), throughput metrics (request rate, queue depth), latency metrics, and custom application-specific signals. More importantly, we'll develop the judgment to choose the right trigger for your specific workload.

What You Will Learn

By the end of this page, you will understand the full taxonomy of scaling triggers, their strengths and weaknesses, how they behave under different workload patterns, and how to select optimal triggers for various system types. You'll be able to design metric-based scaling strategies that maintain performance while minimizing cost.

Taxonomy of Scaling Triggers

Scaling triggers fall into several categories, each with distinct characteristics. Understanding this taxonomy helps you navigate the options and choose appropriately.

Categories of Scaling Triggers
Category	Examples	Best For	Limitations
Infrastructure Metrics	CPU utilization, memory usage, network I/O, disk I/O	General-purpose workloads, compute-bound services	Lagging indicator; may not reflect user experience
Throughput Metrics	Requests per second, messages/second, transactions/second	API servers, event processors, web applications	Doesn't account for request complexity variation
Queue-Based Metrics	Queue depth, message backlog, pending tasks	Async workers, batch processors, event consumers	Can cause over-provisioning for bursty patterns
Latency Metrics	Response time p50/p95/p99, processing duration	User-facing services, SLA-bound systems	Requires careful threshold tuning; can oscillate
Custom Application Metrics	Active users, concurrent connections, business transactions	Application-specific patterns, complex dependencies	Requires instrumentation; harder to standardize
Composite/Derived Metrics	CPU × queue depth, weighted SLA score	Multi-dimensional workloads, sophisticated systems	Complex to configure and debug

Leading vs Lagging Indicators

The most effective scaling triggers are leading indicators—metrics that rise before user experience degrades. CPU utilization is a lagging indicator (it's already high when users are suffering). Queue depth is a leading indicator (a growing queue predicts future latency increases). Whenever possible, choose leading indicators.

CPU Utilization: The Default Choice

CPU utilization is the most common scaling trigger, chosen as the default in almost every cloud platform's auto-scaling configuration wizard. This prevalence has both good reasons and significant pitfalls.

Why CPU Works

•Universally available — Every instance reports CPU metrics without instrumentation
•Easy to understand — 0-100% scale is intuitive for engineers and stakeholders
•Strongly correlated with capacity — High CPU often means high load
•Platform support — First-class support in ASG, MIG, VMSS, HPA
•Works for compute-bound workloads — If your bottleneck is computation, CPU is the right signal

When CPU Fails

•I/O-bound workloads — Services waiting on databases or networks show low CPU despite being overloaded
•Lagging indicator — By the time CPU is high, latency has already spiked
•Memory-bound services — Applications constrained by memory show normal CPU
•Batch processing — 100% CPU might be intentional, not a problem
•Cold-start effects — New instances take time to warm up, skewing averages

How CPU-Based Scaling Works:

Metric Collection: Cloud agents collect CPU samples every 1-60 seconds per instance
Aggregation: Individual instance values are combined—typically using average across the fleet
Threshold Comparison: Current aggregate is compared to your target (e.g., 70%)
Capacity Calculation: System computes desired capacity to return to target

The Math Behind Target Tracking:

For CPU-based target tracking, the formula is approximately:

Desired Capacity = Current Capacity × (Current CPU / Target CPU)

Example: You have 10 instances at 90% CPU with a target of 60%:

Desired = 10 × (90 / 60) = 10 × 1.5 = 15 instances

The system would scale out to 15 instances, expecting the new capacity to distribute load and reduce average CPU to ~60%.

The 70% Myth

Many engineers default to a 70% CPU target without understanding why. The optimal target depends on your workload's variance. For steady traffic, 70-80% is reasonable. For bursty traffic, 40-50% provides headroom for spikes. For latency-sensitive services, even lower targets may be needed. There is no universal 'right' number—you must measure and tune.

Practical Considerations:

Aggregation Period: Using 1-minute average smooths noise but adds latency. Using 10-second samples is reactive but prone to false positives. Balance based on your traffic pattern.
Aggregation Function: Average is common but can hide problems. If you have 10 instances and one is at 100% while nine are at 50%, the average is 55%—no alarm triggers, but requests to the hot instance are suffering.
Instance Heterogeneity: If your scaling group has mixed instance types, CPU percentages aren't directly comparable. Normalize by compute capacity.
Reservation/Steal Effects: On shared infrastructure, CPU steal (time stolen by hypervisor) can make utilization appear lower than actual demand. Monitor for this.

Memory Utilization: Often Overlooked

Memory utilization is the second-most common infrastructure metric for scaling, but it behaves very differently from CPU and requires careful consideration.

Key Differences from CPU:

Memory is gradual, not immediate — Memory usage tends to grow slowly as connections accumulate, caches fill, or objects are allocated. Unlike CPU which can spike instantly, memory follows slower patterns.
Memory exhaustion is catastrophic — When CPU hits 100%, latency increases but the system functions. When memory hits 100%, the OOM killer terminates processes, often crashing your service.
Memory doesn't naturally release — CPU utilization drops immediately when load decreases. Memory often requires garbage collection, connection teardown, or explicit release—meaning it can stay high even after load decreases.
Interpretation is complex — High memory might mean:
- Your service is overloaded (scale out)
- You have a memory leak (scaling won't help)
- Caches are warm (good thing!)
- JVM heap is sized appropriately (fine)

This ambiguity makes memory a tricky scaling trigger.

When to Use Memory-Based Scaling

•Memory-bound workloads — In-memory databases, heavy caching layers, ML model serving where model weights consume RAM
•Connection-heavy services — Each connection consumes memory buffers; scaling provides more connection capacity
•Session-based applications — User sessions stored in memory; more users require more memory
•Safety net scaling — Not as primary trigger, but as emergency scale-out if memory approaches dangerous levels

Practical Configuration:

Unlike CPU where 70-80% targets are common, memory targets should typically be lower:

Workload Type	Recommended Memory Target	Rationale
General services	60-70%	Leave headroom for traffic spikes and GC
JVM-based services	70-80%	JVM manages heap; focus on heap metrics instead
Memory-intensive	50-60%	Memory-bound services need more headroom
Containers (K8s)	70-80%	Memory limits cause OOMKill; stay below

Memory-Specific Considerations:

Distinguish between resident and virtual memory — RSS (resident set size) matters more than virtual memory
Account for kernel buffers and caches — Linux uses free memory for filesystem caches; this isn't a problem. Use 'available' memory, not 'free' memory.
Container memory accounting — In containers, memory limits are hard boundaries. Approaching the limit risks OOMKill.
JVM heap management — For JVM applications, monitor heap utilization rather than system memory. Consider scaling on GC pressure metrics.

Memory Scaling Anti-Pattern

Scaling based on memory without addressing memory leaks is futile. If your application has a memory leak, adding instances temporarily decreases per-instance memory—until the new instances also leak. You'll keep scaling until you hit maximum capacity, then crash. Always investigate high memory before assuming it's load-related.

Queue Depth and Backlog Metrics

For asynchronous and event-driven systems, queue depth (also called backlog or pending messages) is often the most effective scaling trigger. It's a leading indicator that directly reflects the gap between incoming work and processing capacity.

Why Queue Depth Is Powerful:

Direct relationship to capacity need — If your queue is growing, you definitionally need more processing capacity. There's no interpretation needed.
Leading indicator — A growing queue predicts future latency/delay. You can scale before users are affected.
Works when CPU doesn't — For I/O-bound workers that spend time waiting for external systems, CPU may be low even when overwhelmed. Queue depth reveals the true situation.
Natural load shedding — Queues provide built-in buffering, smoothing sudden spikes and giving scaling time to respond.

The Queue-Based Scaling Model:

Queue-based scaling follows a simple principle:

Desired Workers = Queue Depth / (Acceptable Messages Per Worker)

Example: You have 10,000 messages in queue, and each worker processes 1,000 messages before the SLA is violated:

Desired = 10,000 / 1,000 = 10 workers

This is implemented in AWS as "Target Tracking with SQS Queue Depth" and similar mechanisms on other platforms.

Converting Mermaid diagram...

Queue-Based Scaling Best Practices

•Use ApproximateNumberOfMessagesVisible (SQS) or similar — This is the backlog waiting to be processed, not in-flight messages
•Account for message processing time — If each message takes 30 seconds and your SLA is 5 minutes, you can tolerate 10 messages per worker
•Consider visibility timeout — Messages being processed are temporarily invisible; don't double-count capacity needs
•Handle empty queue edge case — When queue is empty, scale to minimum, not zero (unless you want scale-to-zero)
•Use per-partition/shard depth for Kafka — Kafka's consumer lag per partition is the equivalent metric

Queue Depth Variations:

Queue System	Depth Metric	Notes
AWS SQS	ApproximateNumberOfMessagesVisible	Polled every ~1 minute; use CloudWatch
Kafka	Consumer Lag (per partition)	Sum of (latest offset - consumer offset) per partition
RabbitMQ	messages_ready	Available in management API
Redis Streams	XLEN minus XREADGROUP pending	Track consumer group progress
Azure Service Bus	MessageCount	Per queue/subscription
GCP Pub/Sub	num_undelivered_messages	Available via Stackdriver

Advanced: Age of Oldest Message

Beyond depth, consider age of oldest message—how long the oldest item has been waiting. This metric directly reflects latency SLA compliance. If your SLA is "process within 5 minutes" and the oldest message is 4 minutes old, scaling is urgent even if depth is moderate.

Bursty Queues and Over-Provisioning

If your queue receives sudden bursts (e.g., batch job drops 100,000 messages at once), queue-depth scaling can dramatically over-provision. By the time 50 new workers launch, the burst might be processed by existing workers. Use scaling rate limits and cool-down periods to prevent this.

Request Rate and Throughput Metrics

For synchronous services (REST APIs, gRPC services, web applications), request rate is often a better scaling trigger than CPU because it directly measures the work arriving at your system.

Request-Based Scaling (Target Tracking):

The principle is simple: define how many requests per second each instance should handle, and scale to maintain that target.

Desired Capacity = Current RPS / Target RPS Per Instance

Example: You're receiving 5,000 requests/second, and each instance comfortably handles 500 req/s:

Desired = 5,000 / 500 = 10 instances

Why Request Rate Works Well:

Directly tied to load — Requests are the work; no interpretation needed
Leading indicator — Rate increases before CPU does (CPU lags behind)
Predictable capacity planning — If you know requests/instance, you know cost/request
Works for I/O-bound services — Services waiting on databases have low CPU but high request rates

Implementing Request-Based Scaling:

AWS Application Load Balancer + Target Tracking:

Metric: ALBRequestCountPerTarget
Target: 1000 (requests/instance/minute)
Scale-out cooldown: 60 seconds
Scale-in cooldown: 300 seconds

Kubernetes HPA + Custom Metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 100
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 1000

Determining Target RPS:

The challenging part is determining the right requests/instance target. Approaches:

Load Testing — Increase load on a single instance until latency degrades; use 70-80% of that capacity
Production Observation — Monitor healthy instances under steady load; note their sustained rate
SLA-Based — Calculate: if p99 must be <200ms and each request takes ~50ms of processing, you can handle ~500ms of simultaneous request processing time per second per core

Request Complexity Variation

If your requests vary dramatically in cost (some take 10ms, others take 10 seconds), pure request rate scaling will be inaccurate. Consider weighting by request type, using latency as a secondary signal, or separating endpoints into different scaling groups with distinct targets.

Latency-Based Scaling

Latency metrics (response time, processing duration) are the most user-relevant signals—they directly measure what users experience. Scaling on latency is powerful but requires careful configuration.

Why Latency Is the Ultimate Metric:

Directly tied to user experience — Users care about response time, not your CPU utilization
Captures end-to-end effects — Latency increases from any bottleneck (CPU, memory, I/O, downstream)
SLA alignment — Most SLAs are latency-based ('p99 < 200ms'); scaling on latency targets SLA compliance
Early warning for degradation — Latency often increases before other metrics show problems

The Challenge:

Latency-based scaling is tricky because:

Latency can spike briefly — A momentary network hiccup causes latency spike, triggering unnecessary scaling
Latency has multiple causes — High latency might be from downstream services, not your capacity
Percentile selection matters — Average latency hides tails; p99 is noisy; p50 ignores outliers
Oscillation risk — If latency drops after scaling, you scale in; latency rises; you scale out again

Latency-Based Scaling Best Practices

•Use p95 or p99 for scale-out, p50 for scale-in — Scale out based on tail latency (user pain); scale in based on median (typical experience)
•Extend evaluation periods — Require elevated latency for 5-10 minutes before scaling, not 1 sample
•Combine with other metrics — Use latency > threshold AND request_rate > baseline to filter out artifacts
•Distinguish internal vs external latency — If downstream services are slow, scaling your service won't help
•Implement circuit breakers — If latency exceeds a ceiling, fail fast rather than queueing requests
•Use longer scale-in cooldowns — Be slow to remove capacity after latency improves

Practical Implementation:

Latency-based scaling typically uses step scaling or target tracking with latency percentiles:

Step Scaling Example:
- p99 latency 200-300ms: add 10% capacity
- p99 latency 300-500ms: add 25% capacity  
- p99 latency > 500ms: add 50% capacity
- p99 latency < 150ms for 10 min: remove 10% capacity

Latency Metric Sources:

Source	Metric	Notes
Load Balancer	TargetResponseTime	Excludes network to user; measures service time
APM (DataDog, New Relic)	Transaction duration	Full distributed trace; excludes only client network
Application metrics	Custom timers	Most accurate but requires instrumentation
Prometheus	histogram_quantile	Built-in percentile aggregation

The Latency Oscillation Trap

Latency-based scaling can oscillate: high latency → scale out → latency drops → scale in → high latency again. Prevent this with asymmetric cooldowns (long scale-in, short scale-out), hysteresis (different thresholds for up vs down), and stabilization windows (K8s HPA v2 feature).

Custom Application Metrics

Sometimes infrastructure metrics don't capture what matters. Custom application metrics let you scale based on business-relevant signals that only your application understands.

Examples of Custom Scaling Metrics

•Active WebSocket connections — Scale based on concurrent real-time connections, not request rate
•Active video transcoding jobs — Each job consumes predictable resources; scale to job count
•Logged-in users — For stateful applications, scale with user sessions
•GPU memory utilization — For ML inference, GPU metrics matter more than CPU
•Active game rooms/matches — Game servers scale to match count
•Business transactions per second — Scale based on purchases, not page views
•Cache hit ratio — Scale if cache effectiveness drops below threshold
•Thread pool saturation — Scale when executor queues grow

Implementing Custom Metrics Scaling:

Step 1: Expose the Metric

Your application must publish the metric to a monitoring system:

# Python with Prometheus client
from prometheus_client import Gauge

active_connections = Gauge('websocket_active_connections', 
                           'Number of active WebSocket connections')

def on_connect(ws):
    active_connections.inc()
    
def on_disconnect(ws):
    active_connections.dec()

Step 2: Configure External Metrics in Auto-Scaler

For Kubernetes HPA:

metrics:
- type: External
  external:
    metric:
      name: websocket_active_connections
      selector:
        matchLabels:
          service: chat-server
    target:
      type: Value
      value: 1000  # connections per pod

For AWS:

Push custom metric to CloudWatch
Create Target Tracking policy with custom metric

Step 3: Tune Based on Behavior

Custom metrics require experimentation. The relationship between metric value and required capacity isn't always linear. Monitor, adjust targets, and iterate.

The Instrumentation Investment

Custom metrics require instrumentation—code changes to expose metrics, monitoring infrastructure to collect them, and integration with auto-scaling systems. This investment pays off for systems where infrastructure metrics are poor proxies for load, but don't over-engineer. Start with standard metrics; add custom ones when they demonstrably improve scaling behavior.

Choosing the Right Trigger for Your Workload

With all these options, how do you choose? The answer depends on your workload's characteristics. Here's a decision framework:

Trigger Selection by Workload Type
Workload Type	Primary Trigger	Secondary Trigger	Why
Compute-bound API servers	CPU utilization (50-70%)	Request rate	CPU directly measures compute demand
I/O-bound API servers	Request rate or latency	Active connections	CPU is low even when overloaded
Async message consumers	Queue depth	Consumer lag age	Queue size directly measures pending work
Batch processing workers	Queue depth / job count	CPU (to avoid waste)	Job count determines capacity need
WebSocket/real-time servers	Active connections	Memory	Connections consume memory, not CPU
ML inference services	GPU utilization	Request queue depth	GPU is the scarce resource
Database read replicas	Replication lag OR read IOPS	CPU	Replica lag hurts consistency; IOPS indicates read load
Cache layer	Memory utilization + hit rate	Request rate	Cache effectiveness determines value

The Multi-Signal Approach:

Mature systems often use multiple scaling triggers with the following logic:

If (CPU > 70%) OR (RequestRate/Instance > 500) OR (QueueDepth > 10000):
    Scale Out

If (CPU < 30%) AND (RequestRate/Instance < 200) AND (QueueDepth < 1000):
    Scale In (after cooldown)

This approach:

Uses OR for scale-out (any signal indicating overload triggers expansion)
Uses AND for scale-in (all signals must indicate underutilization)
Is inherently conservative (easier to scale out, harder to scale in)

The Composite Metric Pattern:

Advanced operators create derived metrics that combine multiple signals:

load_score = 0.4 * normalize(cpu) 
           + 0.3 * normalize(request_rate) 
           + 0.3 * normalize(queue_depth)

Scale on: target load_score = 0.6

This weighted approach captures multi-dimensional load in a single metric, simplifying policy configuration.

Start Simple, Iterate

Don't over-engineer initial scaling configuration. Start with CPU (most common) or the single most relevant metric. Observe behavior under real traffic. Add complexity only when simple approaches demonstrably fail. Most production systems do fine with one well-tuned trigger.

Summary: Scaling Triggers

We've explored the full landscape of scaling triggers. Let's consolidate the key insights:

Key Takeaways

•CPU is common but not universal — Works for compute-bound workloads; fails for I/O-bound, memory-constrained, or latency-sensitive systems.
•Memory scaling requires caution — High memory might be a leak, not load. Use lower targets than CPU. Good as safety net.
•Queue depth is a leading indicator — For async systems, it's often the best primary trigger, directly reflecting pending work.
•Request rate ties directly to load — Better than CPU for synchronous services, especially I/O-bound ones.
•Latency is user-relevant but tricky — Powerful because it measures user experience; risky because it can oscillate.
•Custom metrics are powerful when needed — Worth the instrumentation investment for specialized workloads.
•Multi-signal is the production pattern — OR for scale-out, AND for scale-in, with the most responsive signal leading.

What's Next:

Now that we know what to measure, we need to define how to respond. The next page explores scaling policies—the rules that translate metric observations into scaling actions. We'll cover target tracking, step scaling, simple scaling, and scheduled scaling.

Page Complete

You now understand the full taxonomy of scaling triggers, from infrastructure metrics like CPU and memory to throughput-based, queue-based, latency-based, and custom application metrics. You can analyze a workload and select appropriate triggers that will drive effective, cost-efficient scaling behavior.