System Design (HLD)Auto-Scaling

Auto-Scaling: Dynamic Resource Management

LevelIntermediate

Duration90 mins

TopicAuto-Scaling

3 / 5

Scaling Policies

From Metrics to Actions

Knowing what to measure is only half the story. The other half is what to do about it. Scaling policies are the decision-making rules that translate metric observations into concrete capacity changes. They answer questions like:

If CPU exceeds 80%, how many instances should we add?
Should we add capacity gradually or aggressively?
What if the metric is fluctuating rapidly?
How do we maintain a specific target rather than just reacting to thresholds?

This page explores the major scaling policy types: target tracking, step scaling, simple scaling, and scheduled scaling. You'll understand when each is appropriate, how to configure them effectively, and the subtle behaviors that can make or break your auto-scaling strategy.

What You Will Learn

By the end of this page, you will understand the mechanics, trade-offs, and configuration parameters of all major scaling policy types. You'll be able to choose the right policy type for different scenarios and configure policies that respond appropriately to workload changes without oscillation or over-provisioning.

Overview of Policy Types

Auto-scaling platforms offer several policy types, each with distinct behavior. Understanding the taxonomy is essential before diving into details.

Scaling Policy Types Comparison
Policy Type	How It Works	Best For	Complexity
Target Tracking	Maintains a metric at a specified target value	Steady-state optimization, most workloads	Low
Step Scaling	Adds/removes capacity in steps based on alarm severity	Tiered response, complex traffic patterns	Medium
Simple Scaling	Adds/removes fixed capacity when alarm triggers	Simple, predictable workloads	Low
Scheduled Scaling	Adjusts capacity at predetermined times	Predictable patterns, events, maintenance	Low
Predictive Scaling	Uses ML to forecast and scale proactively	Recurring patterns, gradual changes	Low (automated)

Policy Hierarchy and Interaction:

Most auto-scaling systems allow multiple policies simultaneously. Understanding how they interact is crucial:

Scheduled scaling sets baseline — If you have a scheduled action to run 20 instances at 9 AM, that becomes the minimum at 9 AM.
Dynamic policies adjust from baseline — Target tracking or step scaling then adjusts from the scheduled baseline.
Maximum wins for scale-out — If one policy wants 15 instances and another wants 20, you get 20.
Minimum wins for scale-in — If one policy wants 10 and another wants 5, you get 10 (conservative).

This behavior creates a maximum envelope—capacity is always at least what any policy requires, providing a safety net.

Default Recommendation: Target Tracking

For most workloads, start with target tracking. It's simpler to configure, automatically handles both scale-out and scale-in, and self-tunes capacity to maintain the target. Use step scaling only when you need tiered responses or target tracking isn't working well.

Target Tracking Policies: The Modern Default

Target tracking is the most widely recommended policy type. It works like a thermostat: you specify the desired metric value (target), and the system continuously adjusts capacity to maintain it.

How Target Tracking Works:

You specify:
- Metric: What to measure (CPU, requests/instance, custom metric)
- Target Value: Desired level (e.g., 60% CPU, 1000 requests/instance)
The system continuously:
- Collects the metric across all instances
- Computes the aggregate (typically average)
- Calculates desired capacity to achieve target
- Scales out/in to reach desired capacity

The Control Algorithm:

For metrics that scale proportionally with capacity (like CPU or requests/instance), the formula is:

Desired Capacity = Current Capacity × (Current Metric Value / Target Value)

Example with CPU:

Current capacity: 10 instances
Current average CPU: 80%
Target CPU: 50%
Desired capacity: 10 × (80/50) = 16 instances

The system adds 6 instances, expecting average CPU to drop to ~50%.

Converting Mermaid diagram...

Target Tracking Advantages

•Simple configuration — Just specify metric and target; no complex step definitions
•Self-tuning — Automatically calculates how much to scale; adapts to changing baselines
•Bidirectional — Handles both scale-out and scale-in with a single policy
•Proportional response — Scales more when deviation is larger; subtle when deviation is small
•Built-in stabilization — Most implementations include logic to prevent oscillation

Configuration Parameters:

Parameter	Description	Typical Values
Target Value	Desired metric level	40-70% for CPU; varies by metric
Scale-Out Cooldown	Wait after scale-out before next action	60-180 seconds
Scale-In Cooldown	Wait after scale-in before next action	300-600 seconds
Disable Scale-In	Prevent policy from removing instances	false (rarely true)
Instance Warmup	Time for new instances to contribute	60-300 seconds

Critical: Instance Warmup

When a new instance launches, it takes time to become healthy and handle traffic. During warmup:

The instance isn't included in metric calculations
This prevents the system from thinking load is still high (because the new instance isn't contributing yet)
Without warmup, you'd keep adding instances unnecessarily

Set warmup to approximately: time_to_boot + time_to_pass_healthcheck + time_to_reach_steady_state

Target Selection Is Critical

The target value profoundly affects behavior. Too high (e.g., 90% CPU) leaves no headroom for traffic spikes—you'll scale too late. Too low (e.g., 30% CPU) wastes money on over-provisioning. The right target depends on your traffic's variance and your latency sensitivity. Measure and tune.

Step Scaling Policies: Tiered Response

Step scaling provides graduated responses based on the severity of the metric deviation. Instead of a single target, you define multiple steps—different actions for different metric ranges.

How Step Scaling Works:

Define a base threshold (the metric value that starts triggering scaling)
Define steps above and below the threshold with corresponding actions
When the metric crosses into a step, execute that step's action

Example Configuration:

Metric: Average CPU Utilization
Base Threshold: 60%

Scale-Out Steps:
+10% (60-70%): Add 1 instance
+20% (70-80%): Add 2 instances  
+30% (80-90%): Add 3 instances
+40% (90%+):   Add 5 instances

Scale-In Steps:
-10% (50-60%): Remove 1 instance
-20% (40-50%): Remove 2 instances
-30% (<40%):   Remove 3 instances

With this configuration:

At 75% CPU (25% above threshold): Add 2 instances
At 95% CPU (35% above threshold): Add 5 instances
At 45% CPU (15% below threshold): Remove 2 instances

When to Use Step Scaling

•Aggressive response to extreme conditions — Want to add 5x instances when severely overloaded, but only 1 when slightly elevated
•Non-proportional metrics — Metrics where doubling capacity doesn't halve the metric value
•Complex scale-in requirements — Need different scale-in thresholds than scale-out (hysteresis)
•Debugging and visibility — Steps are explicit and visible; easier to understand what triggered scaling
•Regulatory or policy requirements — Some organizations require explicit capacity rules

Step Scaling Pros

•Explicit control — You define exactly what happens at each level
•Asymmetric behavior — Different thresholds for scale-out vs scale-in (hysteresis)
•Graduated response — Proportional reaction to severity
•Debuggable — Clear audit trail of which step triggered

Step Scaling Cons

•Complex configuration — Many parameters to tune
•Brittle to change — If workload changes, steps need re-tuning
•Step boundaries are arbitrary — Why 70%? Why not 72%?
•Doesn't auto-adjust — Unlike target tracking, doesn't learn

Step Adjustment Types:

Step scaling supports different adjustment types:

Adjustment Type	Behavior	Example
ChangeInCapacity	Add/remove fixed count	Add 2 instances
ExactCapacity	Set to specific count	Set to 10 instances
PercentChangeInCapacity	Add/remove percentage	Add 20% more instances

PercentChangeInCapacity is particularly useful for large fleets where fixed changes would be too small (adding 1 instance to a 100-instance fleet is only 1% increase).

Hysteresis Prevention

A common step scaling pattern uses different thresholds for scale-out and scale-in. Scale out at 70% CPU, but only scale in when CPU drops to 40%. This gap (hysteresis) prevents rapid oscillation when the metric hovers near a threshold.

Simple Scaling Policies: The Legacy Approach

Simple scaling is the oldest and most basic policy type. It triggers a single action when an alarm transitions to ALARM state, then waits for a cooldown period before any further scaling.

How Simple Scaling Works:

Create a CloudWatch alarm (or equivalent) with a threshold
Attach a scaling policy that fires when the alarm triggers
When the alarm enters ALARM state, execute the action
Wait for cooldown period
If alarm is still ALARM after cooldown, execute again

Example:

Alarm: CPU > 70% for 5 minutes
Action: Add 2 instances
Cooldown: 300 seconds

When CPU exceeds 70% for 5 minutes:

Add 2 instances
Wait 300 seconds
If still alarming, add 2 more
Repeat until alarm clears

Why Simple Scaling Is Problematic

•Cooldown blocks all scaling — During cooldown, even if metrics spike dramatically, no action occurs
•No proportional response — Whether CPU is 71% or 99%, the same action triggers
•Slow to converge — If you need 10 more instances but add 2 per cooldown, it takes 5 cooldown periods
•Oscillation risk — Without careful tuning, can repeatedly over-shoot and under-shoot
•Superseded by better options — Target tracking and step scaling are strictly better in almost all cases

When to Use Simple Scaling (Rare Cases):

Extreme simplicity requirements — When you want absolutely explicit, debuggable behavior
Binary conditions — When the action is "on or off" with no gradients (e.g., activate disaster recovery fleet)
Legacy migration — Transitioning from manual scaling to automated; simple scaling is the minimal step

Migration Path:

If you have simple scaling policies, consider migrating:

Identify current thresholds and actions
Convert to step scaling with single step (equivalent behavior)
Add additional steps for graduated response
Eventually migrate to target tracking for simplicity

AWS Recommendation

AWS explicitly recommends target tracking or step scaling over simple scaling. The AWS documentation states: 'We recommend that you use target tracking scaling policies instead of simple scaling policies for most use cases.' Simple scaling exists primarily for backward compatibility.

Scheduled Scaling: Time-Based Capacity

Scheduled scaling adjusts capacity based on time, not metrics. It's used when you can predict traffic patterns in advance—daily cycles, weekly patterns, known events, or maintenance windows.

How Scheduled Scaling Works:

Define scheduled actions specifying:
- When to execute (cron expression or specific time)
- What capacity to set (min, max, and/or desired)
At the scheduled time, the system adjusts capacity bounds
Dynamic policies (target tracking, step scaling) work within the new bounds

Example Schedule:

# Every weekday at 8 AM: Prepare for business hours
Schedule: 0 8 ? * MON-FRI *  
Action: Set min=20, desired=30, max=100

# Every weekday at 6 PM: Reduce for evening
Schedule: 0 18 ? * MON-FRI *
Action: Set min=5, desired=10, max=50

# Weekends: Minimal capacity
Schedule: 0 0 ? * SAT-SUN *
Action: Set min=2, desired=5, max=20

Scheduled Scaling Use Cases

•Daily traffic patterns — Pre-scale before morning rush; reduce overnight
•Weekly patterns — Weekend vs weekday capacity differences
•Planned events — Product launches, marketing campaigns, TV appearances
•Batch processing windows — Scale up for nightly ETL jobs
•Development/staging — Scale to zero outside business hours
•Maintenance windows — Reduce capacity before maintenance; restore after
•Testing capacity — Temporarily increase capacity for load tests

Pre-Scaling Strategy:

Scheduled scaling enables proactive pre-scaling—adding capacity before load arrives rather than reacting to it:

Without Pre-Scaling:
8:00 AM: Traffic surge begins
8:01 AM: CPU spikes, triggering scale-out
8:05 AM: New instances launching
8:10 AM: Instances passing health checks
8:12 AM: Capacity catches up
→ 12 minutes of degraded performance

With Pre-Scaling:
7:45 AM: Scheduled action adds capacity
7:50 AM: New instances ready
8:00 AM: Traffic surge begins
8:00 AM: Capacity already sufficient
→ Zero impact on users

Combining Scheduled and Dynamic Scaling:

The power of scheduled scaling multiplies when combined with dynamic policies:

Scheduled scaling sets the floor (minimum capacity)
Target tracking adjusts within bounds based on actual load
If traffic exceeds predictions, dynamic scaling adds more
If traffic is below predictions, dynamic scaling removes excess (down to scheduled minimum)

Time Zone Awareness

Scheduled actions must account for time zones. If your traffic is driven by users in multiple regions, you may need different schedules for different regions' fleets. Also consider daylight saving time—some cron implementations adjust automatically, others don't. Use UTC to avoid confusion.

Predictive Scaling: ML-Powered Proactive Scaling

Predictive scaling uses machine learning to analyze historical traffic patterns and forecast future demand, automatically scaling before load arrives. It combines the proactive benefits of scheduled scaling with automatic pattern detection.

How Predictive Scaling Works:

Historical Analysis: The system analyzes 14 days of metric history
Pattern Detection: ML models identify recurring patterns (daily cycles, weekly trends)
Forecasting: Based on patterns, predict future load
Proactive Scaling: Adjust capacity before predicted load arrives
Continuous Learning: Model updates as patterns change

AWS Predictive Scaling Example:

{
  "PredictiveScalingConfiguration": {
    "MetricSpecifications": [{
      "TargetValue": 70,
      "PredefinedMetricPairSpecification": {
        "PredefinedMetricType": "ASGCPUUtilization"
      }
    }],
    "Mode": "ForecastAndScale",
    "SchedulingBufferTime": 300
  }
}

Modes:

ForecastOnly: Generates forecasts but doesn't scale (for evaluation)
ForecastAndScale: Actively scales based on forecasts

Converting Mermaid diagram...

Predictive Scaling Benefits

•No manual schedule maintenance — Automatically detects patterns; adapts when patterns change
•Handles complex patterns — Detects subtle patterns humans might miss (monthly billing cycles, etc.)
•Buffer time — Accounts for instance startup time, ensuring capacity is ready when needed
•Combines with dynamic scaling — Predictive sets baseline; dynamic handles unexpected spikes
•Evaluation mode — Can run forecasts without acting, letting you validate predictions first

Requirements and Limitations:

Requirement	Why It Matters
14+ days of history	ML needs sufficient data to detect patterns
Recurring patterns	Random/unpredictable traffic won't benefit
Steady metric relationship	If capacity→metric relationship is unstable, predictions fail
Minimum scale	Works best with groups that regularly scale between 1 and many

When NOT to Use Predictive Scaling:

Traffic is truly random (no patterns to learn)
Major changes to product/traffic expected (historical data irrelevant)
Very small scale (2-3 instances; not enough variance)
Real-time requirements (still need dynamic scaling for instant spikes)

The Forecast + React Pattern

Best practice combines predictive scaling (for recurring patterns) with target tracking (for real-time adjustment). Predictive scaling ensures you're never caught off-guard by predictable patterns. Target tracking handles the unexpected—traffic spikes from viral content, attack patterns, or unexpected adoption.

Policy Configuration Best Practices

Proper policy configuration is as important as choosing the right policy type. These best practices apply across policy types and platforms:

Universal Configuration Best Practices

•Set appropriate min/max bounds — Minimum ensures availability (never scale to zero unless intentional). Maximum prevents runaway costs and protects downstream systems from being overwhelmed.
•Configure instance warmup time — New instances need time to boot, pass health checks, and warm caches. Exclude them from metrics until ready to prevent over-scaling.
•Use asymmetric cooldowns — Scale out quickly (60-120s cooldown) to respond to load. Scale in slowly (300-600s cooldown) to avoid removing capacity prematurely.
•Leave headroom in targets — Don't target 90% CPU; you have no buffer for spikes. 50-70% gives room to breathe while maintaining efficiency.
•Test in staging first — Trigger artificial load and observe scaling behavior before production deployment.
•Monitor scaling activity — CloudWatch metrics for GroupInServiceInstances, scaling activity history, and cooldown states should be on dashboards.

The Scaling Budget Pattern:

For cost control, implement a scaling budget—limits on how much capacity can change in a given period:

# Example: Max 10 instances per 10-minute window
Max Scale-Out Rate: +10 instances per 600 seconds
Max Scale-In Rate: -5 instances per 600 seconds

# Triggered by:
- Step scaling step limits
- Instance lifecycle hooks that throttle
- Custom Lambda-based governors

This prevents scenarios where runaway metrics trigger massive scale-outs that overwhelm downstream systems or blow through budgets.

The Minimum Viable Capacity Pattern:

Always define your minimum capacity based on:

Availability requirements — Single instance = single point of failure. Use min=2 for basic redundancy.
Multi-AZ deployment — If across 3 AZs, min=3 ensures at least one instance per AZ.
Traffic baseline — Even at lowest traffic, what capacity is needed for acceptable latency?
Downstream dependencies — Some databases have connection minimums; some services need warm clients.

Recommended Default Configuration Values
Parameter	Scale-Out Value	Scale-In Value	Notes
Cooldown Period	60-120 seconds	300-600 seconds	Asymmetric: fast out, slow in
Evaluation Periods	1-2	5-10	More samples for scale-in decision
Target (CPU)	50-70%	Same	Leave headroom for spikes
Target (Requests)	70-80% of tested max	Same	Based on load test results
Instance Warmup	180-300 seconds	N/A	Time to fully contribute
Minimum Capacity	2+	N/A	At least 2 for redundancy

The Scaling Cliff

Beware the scenario where you hit maximum capacity. If max is 100 and you need 150, you're at 100 with degraded performance. Either your max is too low, or you have a capacity planning problem. Set alerts at 80% of max to investigate before hitting the ceiling.

Summary: Scaling Policies

We've covered the complete landscape of scaling policies. Let's consolidate the key insights:

Key Takeaways

•Target tracking is the modern default — Simple, self-tuning, bidirectional. Start here unless you have specific requirements for other types.
•Step scaling provides graduated response — Useful when you need different actions for different severity levels, or explicit hysteresis.
•Simple scaling is legacy — Avoid for new implementations; migrate existing simple policies to target tracking or step scaling.
•Scheduled scaling enables pre-scaling — Combine with dynamic policies: scheduled sets baseline, dynamic adjusts within bounds.
•Predictive scaling automates pattern detection — ML-powered forecasting for recurring traffic patterns; combines with dynamic for comprehensive coverage.
•Configuration details matter — Warmup time, cooldown periods, min/max bounds, and asymmetric behavior prevent oscillation and control costs.

What's Next:

Policies define how to respond, but they don't address what happens immediately after scaling. The next page explores cool-down periods—the critical mechanism that prevents scaling oscillation and ensures stable, efficient scaling behavior.

Page Complete

You now understand the mechanics, configuration, and trade-offs of all major scaling policy types. You can select appropriate policy types for different workloads, configure them with production-ready parameters, and combine multiple policies for comprehensive coverage.

3 / 5

Loading learning content...

System Design (HLD)Auto-Scaling

Auto-Scaling: Dynamic Resource Management

LevelIntermediate

Duration90 mins

TopicAuto-Scaling

3 / 5

Scaling Policies

From Metrics to Actions

If CPU exceeds 80%, how many instances should we add?
Should we add capacity gradually or aggressively?
What if the metric is fluctuating rapidly?
How do we maintain a specific target rather than just reacting to thresholds?

What You Will Learn

Overview of Policy Types

Auto-scaling platforms offer several policy types, each with distinct behavior. Understanding the taxonomy is essential before diving into details.

Scaling Policy Types Comparison
Policy Type	How It Works	Best For	Complexity
Target Tracking	Maintains a metric at a specified target value	Steady-state optimization, most workloads	Low
Step Scaling	Adds/removes capacity in steps based on alarm severity	Tiered response, complex traffic patterns	Medium
Simple Scaling	Adds/removes fixed capacity when alarm triggers	Simple, predictable workloads	Low
Scheduled Scaling	Adjusts capacity at predetermined times	Predictable patterns, events, maintenance	Low
Predictive Scaling	Uses ML to forecast and scale proactively	Recurring patterns, gradual changes	Low (automated)

Policy Hierarchy and Interaction:

Most auto-scaling systems allow multiple policies simultaneously. Understanding how they interact is crucial:

Scheduled scaling sets baseline — If you have a scheduled action to run 20 instances at 9 AM, that becomes the minimum at 9 AM.
Dynamic policies adjust from baseline — Target tracking or step scaling then adjusts from the scheduled baseline.
Maximum wins for scale-out — If one policy wants 15 instances and another wants 20, you get 20.
Minimum wins for scale-in — If one policy wants 10 and another wants 5, you get 10 (conservative).

This behavior creates a maximum envelope—capacity is always at least what any policy requires, providing a safety net.

Default Recommendation: Target Tracking

Target Tracking Policies: The Modern Default

Target tracking is the most widely recommended policy type. It works like a thermostat: you specify the desired metric value (target), and the system continuously adjusts capacity to maintain it.

How Target Tracking Works:

You specify:
- Metric: What to measure (CPU, requests/instance, custom metric)
- Target Value: Desired level (e.g., 60% CPU, 1000 requests/instance)
The system continuously:
- Collects the metric across all instances
- Computes the aggregate (typically average)
- Calculates desired capacity to achieve target
- Scales out/in to reach desired capacity

The Control Algorithm:

For metrics that scale proportionally with capacity (like CPU or requests/instance), the formula is:

Desired Capacity = Current Capacity × (Current Metric Value / Target Value)

Example with CPU:

Current capacity: 10 instances
Current average CPU: 80%
Target CPU: 50%
Desired capacity: 10 × (80/50) = 16 instances

The system adds 6 instances, expecting average CPU to drop to ~50%.

Converting Mermaid diagram...

Target Tracking Advantages

•Simple configuration — Just specify metric and target; no complex step definitions
•Self-tuning — Automatically calculates how much to scale; adapts to changing baselines
•Bidirectional — Handles both scale-out and scale-in with a single policy
•Proportional response — Scales more when deviation is larger; subtle when deviation is small
•Built-in stabilization — Most implementations include logic to prevent oscillation

Configuration Parameters:

Parameter	Description	Typical Values
Target Value	Desired metric level	40-70% for CPU; varies by metric
Scale-Out Cooldown	Wait after scale-out before next action	60-180 seconds
Scale-In Cooldown	Wait after scale-in before next action	300-600 seconds
Disable Scale-In	Prevent policy from removing instances	false (rarely true)
Instance Warmup	Time for new instances to contribute	60-300 seconds

Critical: Instance Warmup

When a new instance launches, it takes time to become healthy and handle traffic. During warmup:

The instance isn't included in metric calculations
This prevents the system from thinking load is still high (because the new instance isn't contributing yet)
Without warmup, you'd keep adding instances unnecessarily

Set warmup to approximately: time_to_boot + time_to_pass_healthcheck + time_to_reach_steady_state

Target Selection Is Critical

Step Scaling Policies: Tiered Response

Step scaling provides graduated responses based on the severity of the metric deviation. Instead of a single target, you define multiple steps—different actions for different metric ranges.

How Step Scaling Works:

Define a base threshold (the metric value that starts triggering scaling)
Define steps above and below the threshold with corresponding actions
When the metric crosses into a step, execute that step's action

Example Configuration:

Metric: Average CPU Utilization
Base Threshold: 60%

Scale-Out Steps:
+10% (60-70%): Add 1 instance
+20% (70-80%): Add 2 instances  
+30% (80-90%): Add 3 instances
+40% (90%+):   Add 5 instances

Scale-In Steps:
-10% (50-60%): Remove 1 instance
-20% (40-50%): Remove 2 instances
-30% (<40%):   Remove 3 instances

With this configuration:

At 75% CPU (25% above threshold): Add 2 instances
At 95% CPU (35% above threshold): Add 5 instances
At 45% CPU (15% below threshold): Remove 2 instances

When to Use Step Scaling

•Aggressive response to extreme conditions — Want to add 5x instances when severely overloaded, but only 1 when slightly elevated
•Non-proportional metrics — Metrics where doubling capacity doesn't halve the metric value
•Complex scale-in requirements — Need different scale-in thresholds than scale-out (hysteresis)
•Debugging and visibility — Steps are explicit and visible; easier to understand what triggered scaling
•Regulatory or policy requirements — Some organizations require explicit capacity rules

Step Scaling Pros

•Explicit control — You define exactly what happens at each level
•Asymmetric behavior — Different thresholds for scale-out vs scale-in (hysteresis)
•Graduated response — Proportional reaction to severity
•Debuggable — Clear audit trail of which step triggered

Step Scaling Cons

•Complex configuration — Many parameters to tune
•Brittle to change — If workload changes, steps need re-tuning
•Step boundaries are arbitrary — Why 70%? Why not 72%?
•Doesn't auto-adjust — Unlike target tracking, doesn't learn

Step Adjustment Types:

Step scaling supports different adjustment types:

Adjustment Type	Behavior	Example
ChangeInCapacity	Add/remove fixed count	Add 2 instances
ExactCapacity	Set to specific count	Set to 10 instances
PercentChangeInCapacity	Add/remove percentage	Add 20% more instances

PercentChangeInCapacity is particularly useful for large fleets where fixed changes would be too small (adding 1 instance to a 100-instance fleet is only 1% increase).

Hysteresis Prevention

Simple Scaling Policies: The Legacy Approach

Simple scaling is the oldest and most basic policy type. It triggers a single action when an alarm transitions to ALARM state, then waits for a cooldown period before any further scaling.

How Simple Scaling Works:

Create a CloudWatch alarm (or equivalent) with a threshold
Attach a scaling policy that fires when the alarm triggers
When the alarm enters ALARM state, execute the action
Wait for cooldown period
If alarm is still ALARM after cooldown, execute again

Example:

Alarm: CPU > 70% for 5 minutes
Action: Add 2 instances
Cooldown: 300 seconds

When CPU exceeds 70% for 5 minutes:

Add 2 instances
Wait 300 seconds
If still alarming, add 2 more
Repeat until alarm clears

Why Simple Scaling Is Problematic

•Cooldown blocks all scaling — During cooldown, even if metrics spike dramatically, no action occurs
•No proportional response — Whether CPU is 71% or 99%, the same action triggers
•Slow to converge — If you need 10 more instances but add 2 per cooldown, it takes 5 cooldown periods
•Oscillation risk — Without careful tuning, can repeatedly over-shoot and under-shoot
•Superseded by better options — Target tracking and step scaling are strictly better in almost all cases

When to Use Simple Scaling (Rare Cases):

Extreme simplicity requirements — When you want absolutely explicit, debuggable behavior
Binary conditions — When the action is "on or off" with no gradients (e.g., activate disaster recovery fleet)
Legacy migration — Transitioning from manual scaling to automated; simple scaling is the minimal step

Migration Path:

If you have simple scaling policies, consider migrating:

Identify current thresholds and actions
Convert to step scaling with single step (equivalent behavior)
Add additional steps for graduated response
Eventually migrate to target tracking for simplicity

AWS Recommendation

Scheduled Scaling: Time-Based Capacity

Scheduled scaling adjusts capacity based on time, not metrics. It's used when you can predict traffic patterns in advance—daily cycles, weekly patterns, known events, or maintenance windows.

How Scheduled Scaling Works:

Define scheduled actions specifying:
- When to execute (cron expression or specific time)
- What capacity to set (min, max, and/or desired)
At the scheduled time, the system adjusts capacity bounds
Dynamic policies (target tracking, step scaling) work within the new bounds

Example Schedule:

# Every weekday at 8 AM: Prepare for business hours
Schedule: 0 8 ? * MON-FRI *  
Action: Set min=20, desired=30, max=100

# Every weekday at 6 PM: Reduce for evening
Schedule: 0 18 ? * MON-FRI *
Action: Set min=5, desired=10, max=50

# Weekends: Minimal capacity
Schedule: 0 0 ? * SAT-SUN *
Action: Set min=2, desired=5, max=20

Scheduled Scaling Use Cases

•Daily traffic patterns — Pre-scale before morning rush; reduce overnight
•Weekly patterns — Weekend vs weekday capacity differences
•Planned events — Product launches, marketing campaigns, TV appearances
•Batch processing windows — Scale up for nightly ETL jobs
•Development/staging — Scale to zero outside business hours
•Maintenance windows — Reduce capacity before maintenance; restore after
•Testing capacity — Temporarily increase capacity for load tests

Pre-Scaling Strategy:

Scheduled scaling enables proactive pre-scaling—adding capacity before load arrives rather than reacting to it:

Without Pre-Scaling:
8:00 AM: Traffic surge begins
8:01 AM: CPU spikes, triggering scale-out
8:05 AM: New instances launching
8:10 AM: Instances passing health checks
8:12 AM: Capacity catches up
→ 12 minutes of degraded performance

With Pre-Scaling:
7:45 AM: Scheduled action adds capacity
7:50 AM: New instances ready
8:00 AM: Traffic surge begins
8:00 AM: Capacity already sufficient
→ Zero impact on users

Combining Scheduled and Dynamic Scaling:

The power of scheduled scaling multiplies when combined with dynamic policies:

Scheduled scaling sets the floor (minimum capacity)
Target tracking adjusts within bounds based on actual load
If traffic exceeds predictions, dynamic scaling adds more
If traffic is below predictions, dynamic scaling removes excess (down to scheduled minimum)

Time Zone Awareness

Predictive Scaling: ML-Powered Proactive Scaling

How Predictive Scaling Works:

Historical Analysis: The system analyzes 14 days of metric history
Pattern Detection: ML models identify recurring patterns (daily cycles, weekly trends)
Forecasting: Based on patterns, predict future load
Proactive Scaling: Adjust capacity before predicted load arrives
Continuous Learning: Model updates as patterns change

AWS Predictive Scaling Example:

{
  "PredictiveScalingConfiguration": {
    "MetricSpecifications": [{
      "TargetValue": 70,
      "PredefinedMetricPairSpecification": {
        "PredefinedMetricType": "ASGCPUUtilization"
      }
    }],
    "Mode": "ForecastAndScale",
    "SchedulingBufferTime": 300
  }
}

Modes:

ForecastOnly: Generates forecasts but doesn't scale (for evaluation)
ForecastAndScale: Actively scales based on forecasts

Converting Mermaid diagram...

Predictive Scaling Benefits

•No manual schedule maintenance — Automatically detects patterns; adapts when patterns change
•Handles complex patterns — Detects subtle patterns humans might miss (monthly billing cycles, etc.)
•Buffer time — Accounts for instance startup time, ensuring capacity is ready when needed
•Combines with dynamic scaling — Predictive sets baseline; dynamic handles unexpected spikes
•Evaluation mode — Can run forecasts without acting, letting you validate predictions first

Requirements and Limitations:

Requirement	Why It Matters
14+ days of history	ML needs sufficient data to detect patterns
Recurring patterns	Random/unpredictable traffic won't benefit
Steady metric relationship	If capacity→metric relationship is unstable, predictions fail
Minimum scale	Works best with groups that regularly scale between 1 and many

When NOT to Use Predictive Scaling:

Traffic is truly random (no patterns to learn)
Major changes to product/traffic expected (historical data irrelevant)
Very small scale (2-3 instances; not enough variance)
Real-time requirements (still need dynamic scaling for instant spikes)

The Forecast + React Pattern

Policy Configuration Best Practices

Proper policy configuration is as important as choosing the right policy type. These best practices apply across policy types and platforms:

Universal Configuration Best Practices

•Set appropriate min/max bounds — Minimum ensures availability (never scale to zero unless intentional). Maximum prevents runaway costs and protects downstream systems from being overwhelmed.
•Configure instance warmup time — New instances need time to boot, pass health checks, and warm caches. Exclude them from metrics until ready to prevent over-scaling.
•Use asymmetric cooldowns — Scale out quickly (60-120s cooldown) to respond to load. Scale in slowly (300-600s cooldown) to avoid removing capacity prematurely.
•Leave headroom in targets — Don't target 90% CPU; you have no buffer for spikes. 50-70% gives room to breathe while maintaining efficiency.
•Test in staging first — Trigger artificial load and observe scaling behavior before production deployment.
•Monitor scaling activity — CloudWatch metrics for GroupInServiceInstances, scaling activity history, and cooldown states should be on dashboards.

The Scaling Budget Pattern:

For cost control, implement a scaling budget—limits on how much capacity can change in a given period:

# Example: Max 10 instances per 10-minute window
Max Scale-Out Rate: +10 instances per 600 seconds
Max Scale-In Rate: -5 instances per 600 seconds

# Triggered by:
- Step scaling step limits
- Instance lifecycle hooks that throttle
- Custom Lambda-based governors

This prevents scenarios where runaway metrics trigger massive scale-outs that overwhelm downstream systems or blow through budgets.

The Minimum Viable Capacity Pattern:

Always define your minimum capacity based on:

Availability requirements — Single instance = single point of failure. Use min=2 for basic redundancy.
Multi-AZ deployment — If across 3 AZs, min=3 ensures at least one instance per AZ.
Traffic baseline — Even at lowest traffic, what capacity is needed for acceptable latency?
Downstream dependencies — Some databases have connection minimums; some services need warm clients.

Recommended Default Configuration Values
Parameter	Scale-Out Value	Scale-In Value	Notes
Cooldown Period	60-120 seconds	300-600 seconds	Asymmetric: fast out, slow in
Evaluation Periods	1-2	5-10	More samples for scale-in decision
Target (CPU)	50-70%	Same	Leave headroom for spikes
Target (Requests)	70-80% of tested max	Same	Based on load test results
Instance Warmup	180-300 seconds	N/A	Time to fully contribute
Minimum Capacity	2+	N/A	At least 2 for redundancy

The Scaling Cliff

Summary: Scaling Policies

We've covered the complete landscape of scaling policies. Let's consolidate the key insights:

Key Takeaways

•Target tracking is the modern default — Simple, self-tuning, bidirectional. Start here unless you have specific requirements for other types.
•Step scaling provides graduated response — Useful when you need different actions for different severity levels, or explicit hysteresis.
•Simple scaling is legacy — Avoid for new implementations; migrate existing simple policies to target tracking or step scaling.
•Scheduled scaling enables pre-scaling — Combine with dynamic policies: scheduled sets baseline, dynamic adjusts within bounds.
•Predictive scaling automates pattern detection — ML-powered forecasting for recurring traffic patterns; combines with dynamic for comprehensive coverage.
•Configuration details matter — Warmup time, cooldown periods, min/max bounds, and asymmetric behavior prevent oscillation and control costs.

What's Next:

Page Complete

3 / 5