Loading learning content...
Imagine a thermostat that turns the heater on the instant the temperature drops 0.1°F, then turns it off the instant it rises 0.1°F. The result would be chaos—the heater would cycle on and off dozens of times per minute, wasting energy, wearing out components, and never achieving comfortable temperature. Real thermostats include hysteresis and delay—they wait before reacting and require a meaningful change before acting.
Auto-scaling systems face the exact same challenge. Without careful stabilization mechanisms, they oscillate: scaling out when metrics spike, scaling in when metrics drop from the additional capacity, then scaling out again when the reduced capacity proves insufficient. This cycle repeats endlessly, wasting money, causing instability, and potentially overwhelming downstream systems.
Cool-down periods are the primary mechanism for preventing this oscillation. This page explores cool-down periods in depth: what they are, why they matter, how to configure them, and the advanced stabilization techniques used in production systems.
By the end of this page, you will understand the mechanics of cool-down periods, the problems they solve, how to configure them for different workloads, and advanced stabilization patterns. You'll be able to diagnose and fix oscillating auto-scaling systems and design scaling configurations that are both responsive and stable.
A cool-down period (also called cooldown or stabilization window) is a configurable duration after a scaling activity during which the auto-scaling system ignores metric changes and suppresses further scaling actions. It's a deliberate pause that allows the system to reach a new steady state before making additional adjustments.
The Mechanics:
Why Cool-Down Is Necessary:
Without cool-down, the following scenario occurs:
Time 0:00 - CPU at 85%, trigger scale-out, add 5 instances
Time 0:01 - Instances still launching, CPU still 85%, add 5 more
Time 0:02 - Still launching, CPU 85%, add 5 more
...
Time 0:05 - All 15 new instances come online
Time 0:06 - CPU drops to 30% (massively over-provisioned)
Time 0:07 - Trigger scale-in, remove 10 instances
Time 0:08 - CPU rises to 90%, trigger scale-out
→ Oscillation continues indefinitely
Cool-down breaks this cycle by waiting for the system to stabilize before re-evaluating.
Think of cool-down as 'wait and see.' After scaling, you're telling the system: 'I just made a change. Let's see how it plays out before making another change.' This patience is essential for stability.
Different platforms and configurations support various types of cool-down periods. Understanding the distinctions is crucial for proper configuration.
| Type | Scope | When Applied | Typical Values |
|---|---|---|---|
| Default Cooldown | All policies for a scaling group | After any scaling activity from any policy | 300 seconds |
| Policy-Specific Cooldown | Single scaling policy | After scaling from this specific policy | Overrides default |
| Scale-Out Cooldown | Scale-out actions only | After adding capacity | 60-180 seconds (shorter) |
| Scale-In Cooldown | Scale-in actions only | After removing capacity | 300-600 seconds (longer) |
| Instance Warmup | New instances | Exclude new instances from metrics | 120-300 seconds |
Asymmetric Cool-Downs: The Production Pattern
Production systems almost always use asymmetric cool-downs—shorter for scale-out, longer for scale-in:
Scale-Out Cooldown: 60 seconds
Scale-In Cooldown: 300 seconds
Why asymmetric?
Scale-out is urgent — If you're under load, you want to add capacity quickly. Short cooldown allows rapid response to sustained load.
Scale-in can wait — If you've scaled in prematurely and traffic returns, you need to scale back out. Longer cooldown gives time to confirm traffic really is declining.
Cost of error is asymmetric — Scaling out too much costs money (acceptable). Scaling in too much degrades user experience (unacceptable).
Traffic patterns are often bursty — A brief dip in traffic shouldn't trigger immediate scale-in; the dip may be momentary.
Target tracking policies have built-in stabilization that often reduces the need for long cooldowns. They automatically pace scaling actions and account for instance warmup. However, explicit cooldowns still provide additional protection and are recommended even with target tracking.
Oscillation (also called thrashing or flapping) is the phenomenon where an auto-scaling system rapidly cycles between scaling out and scaling in. Understanding the causes helps you prevent and diagnose it.
Anatomy of Oscillation:
Time Instances CPU Action
─────────────────────────────────────────
0:00 10 80% Scale out (+5)
0:02 15 45% Warming up...
0:05 15 40% Scale in (-3)
0:07 12 55% Warming down...
0:10 12 75% Scale out (+4)
0:12 16 42% Scale in (-4)
0:15 12 78% Scale out...
→ The cycle never ends
Root Causes:
Cooldowns too short — Not enough time for system to stabilize between actions
Thresholds too close — Scale out at 70%, scale in at 65% creates a narrow band where oscillation is likely
Metric volatility — If your metric naturally fluctuates ±20%, and your thresholds span 15%, you'll oscillate
Instance warmup not configured — New instances drag down averages while not actually serving traffic
Step sizes too large — Adding 10 instances when you needed 3 causes over-correction
External factors — Downstream service latency causes variable load, triggering scaling that doesn't help
The Hysteresis Solution:
Hysteresis is the pattern of using different thresholds for scale-out and scale-in, creating a gap that prevents oscillation:
Without Hysteresis:
- Scale out when CPU > 70%
- Scale in when CPU < 70%
→ CPU at 70% causes constant toggling
With Hysteresis:
- Scale out when CPU > 75%
- Scale in when CPU < 50%
→ CPU must drop 25 points before scale-in
→ This gap absorbs normal fluctuations
The width of the hysteresis gap should match your metric's natural volatility. If CPU commonly swings ±15%, your gap should be at least 20%.
Beyond compute costs, oscillation causes: instance launch time charges, data transfer for container images, database connection pool exhaustion, cache cold starts on new instances, and increased error rates during transition periods. Preventing oscillation is worth the engineering investment.
Choosing cool-down values isn't guesswork—it should be based on measurable characteristics of your system. Here's a systematic approach:
The Cool-Down Formula:
Minimum Scale-Out Cooldown = Instance Warmup Time + Metric Stabilization Time
Where:
- Instance Warmup Time = Time from launch to passing health check and serving traffic
- Metric Stabilization Time = Time for metrics to reflect new capacity (1-2 metric periods)
Example Calculation:
Instance boot time: 60 seconds
Application startup: 45 seconds
Health check interval: 30 seconds (need to pass 2 checks = 60 seconds)
Load balancer registration: 10 seconds
→ Instance Warmup Time: ~175 seconds, round to 180 seconds (3 minutes)
Metric collection period: 60 seconds
Metric processing delay: 30 seconds
→ Metric Stabilization: 90 seconds
Minimum Scale-Out Cooldown = 180 + 90 = 270 seconds
→ Round to 300 seconds (5 minutes) for safety margin
Scale-In Cooldown:
Scale-In Cooldown = Scale-Out Cooldown × Safety Multiplier
Where:
- Safety Multiplier is typically 2x for moderate traffic variability
- Or 3x for high variability / critical systems
Example: If scale-out cooldown is 300 seconds:
| System Type | Scale-Out Cooldown | Scale-In Cooldown | Instance Warmup | Notes |
|---|---|---|---|---|
| Containers (K8s) | 30-60 seconds | 180-300 seconds | 30-60 seconds | Fast startup; can be aggressive |
| VM-based services | 180-300 seconds | 600-900 seconds | 120-180 seconds | Slower startup requires patience |
| JVM applications | 300-600 seconds | 900-1200 seconds | 180-300 seconds | JIT warmup takes time |
| ML inference | 600-900 seconds | 1200-1800 seconds | 300-600 seconds | Model loading is slow |
| Serverless (Lambda) | N/A | N/A | Provisioned: 0-60s | Platform handles scaling |
The best cooldown values come from measurement. Track how long your instances actually take from launch to serving traffic at steady-state performance. Use this empirical data, not theoretical values. Set up dashboards showing time-to-first-request and time-to-stable-latency for new instances.
Beyond basic cooldowns, sophisticated auto-scaling systems employ advanced techniques for stability. These are especially important in Kubernetes and enterprise environments.
1. Stabilization Windows (Kubernetes HPA v2)
Kubernetes HPA v2 introduced sophisticated stabilization through the behavior field:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Look back 5 minutes
policies:
- type: Percent
value: 10
periodSeconds: 60
selectPolicy: Min # Most conservative
scaleUp:
stabilizationWindowSeconds: 0 # No look-back; scale immediately
policies:
- type: Percent
value: 100
periodSeconds: 15
selectPolicy: Max # Most aggressive
How it works:
2. Policy-Based Scaling Limits
Limit how much capacity can change per unit time:
policies:
- type: Pods
value: 4
periodSeconds: 60
- type: Percent
value: 10
periodSeconds: 60
This allows at most 4 pods OR 10% of current replicas (whichever applies) per 60 seconds, preventing massive instantaneous changes.
3. Warm Pools (AWS)
Warm pools keep pre-initialized instances in a stopped or running state, ready for instant activation:
ASG Warm Pool Configuration:
- Pool Size: 10 instances
- Pool State: Stopped (costs only EBS)
- Reuse Policy: Reuse on scale-in
Behavior:
1. Scale-out request comes
2. Instead of launching new instance, activate warm instance
3. Time-to-traffic: seconds instead of minutes
Warm pools reduce effective warmup time, allowing shorter cooldowns while maintaining responsiveness.
4. Predictive Pre-Scaling
Combine predictive scaling with reactive policies:
7:50 AM: Predictive scaling adds 10 instances (based on historical pattern)
8:00 AM: Traffic arrives; capacity is already sufficient
8:30 AM: Unexpected 20% surge; target tracking adds 5 more
→ Predictive handles the expected; reactive handles the unexpected
This reduces the burden on cooldown configuration because capacity is often pre-positioned.
Every stabilization technique trades responsiveness for stability. The goal isn't maximum stability—it's the right balance for your workload. A real-time trading system needs sub-minute scaling; an internal analytics tool can wait 10 minutes. Match your configuration to your requirements.
When auto-scaling isn't behaving as expected, cooldown configuration is often the culprit. Here's a systematic diagnostic approach:
| Symptom | Likely Cause | Solution |
|---|---|---|
| Scaling too slowly under load | Cooldown too long | Reduce scale-out cooldown; verify instance warmup is accurate |
| Oscillating up and down | Cooldown too short; no hysteresis | Increase cooldowns; add gap between thresholds |
| Scaling out but CPU stays high | Instance warmup not configured | Set warmup time; exclude new instances from metrics |
| Never scaling in | Scale-in cooldown too long; scale-in disabled | Review policy; check for unintentional disable_scale_in flag |
| Scale-in immediately after scale-out | Asymmetric cooldowns not configured | Set scale-in cooldown 2-3x scale-out cooldown |
| Erratic scaling with steady traffic | Metric volatility exceeds threshold gap | Widen thresholds; smooth metrics; longer evaluation periods |
Debugging Checklist:
Check scaling activity history
kubectl describe hpa <name>Overlay metrics with scaling events
Verify instance readiness timing
Check for cooldown violations
Examine scaling policy configuration
1234567891011
# Get recent scaling activitiesaws autoscaling describe-scaling-activities \ --auto-scaling-group-name my-asg \ --max-items 20 \ --query 'Activities[*].[StartTime,StatusCode,Description]' \ --output table # Check cooldown statusaws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names my-asg \ --query 'AutoScalingGroups[*].[DefaultCooldown,DesiredCapacity,Instances[*].HealthStatus]'During cooldown, you might see activities with status 'WaitingForSpotInstanceRequestId' or 'InProgress' for extended periods. This is normal—instances are launching within the cooldown. The issue is when you see 'Successful' activities in rapid succession, indicating cooldown is too short or being bypassed.
Each platform implements cooldowns differently. Here's platform-specific guidance:
AWS Auto Scaling Group Cooldowns:
Default Cooldown (Group Level):
{
"AutoScalingGroupName": "my-asg",
"DefaultCooldown": 300
}
Policy-Level Cooldown (Overrides Default):
{
"PolicyName": "scale-out-policy",
"Cooldown": 60
}
Target Tracking Warmup:
{
"TargetTrackingConfiguration": {
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ASGAverageCPUUtilization"
},
"TargetValue": 50.0
},
"EstimatedInstanceWarmup": 180
}
Key AWS Behaviors:
DefaultCooldown applies when no policy-specific cooldown is setWe've explored cool-down periods comprehensively. Let's consolidate the key insights:
What's Next:
We've covered reactive and time-based scaling. The final page explores predictive scaling—using machine learning to forecast demand and scale proactively, before load arrives. This represents the cutting edge of auto-scaling technology.
You now understand cooldown periods deeply: what they are, why they matter, how to calculate optimal values, advanced stabilization techniques, and diagnostic approaches. You can configure stable, efficient auto-scaling that avoids oscillation while remaining responsive to genuine load changes.