Loading learning content...
A B2B SaaS company serving enterprise customers noticed a peculiar pattern: their infrastructure ran at near-capacity during business hours (9 AM - 6 PM EST) but dropped to 15% utilization during nights and weekends. Yet they were paying full price 24/7—168 hours per week for a workload that only needed full capacity for 45 hours.
After implementing aggressive auto-scaling policies, they reduced their instance count from a fixed 100 to a dynamic 20-100 range. The result: a 45% reduction in compute costs without any change to user experience. Their infrastructure now breathes with their business.
Auto-scaling is typically framed as an availability tool—scaling up to handle load spikes. But it's equally powerful as a cost optimization strategy—scaling down when demand decreases. The cloud's promise is "pay for what you use," but that only works if you're not running idle capacity.
This page reframes auto-scaling through a cost lens: how to design scaling policies that minimize waste, implement scale-to-zero patterns, and build systems that automatically match capacity to demand.
By the end of this page, you will understand how to design auto-scaling for cost efficiency: configuring scale-down policies, implementing scheduled scaling, building scale-to-zero architectures, and measuring the cost impact of your scaling strategies.
To understand why auto-scaling is a cost strategy, we need to understand the difference between peak capacity and average demand. Most workloads exhibit significant variability:
Temporal patterns:
The fixed-capacity problem:
Without auto-scaling, you must provision for peak capacity:
| Time Period | Demand | Fixed Capacity | Utilization | Waste |
|---|---|---|---|---|
| Peak (4 hrs/day) | 100 units | 100 units | 100% | 0% |
| Normal (8 hrs/day) | 50 units | 100 units | 50% | 50% |
| Off-peak (12 hrs/day) | 20 units | 100 units | 20% | 80% |
| Weighted Average | ~42 units | 100 units | 42% | 58% |
You're paying for 100 units of capacity but only using 42 on average—58% waste.
The dynamic-capacity opportunity:
With effective auto-scaling:
| Time Period | Demand | Dynamic Capacity | Utilization | Waste |
|---|---|---|---|---|
| Peak (4 hrs/day) | 100 units | 120 units (+20% buffer) | 83% | 17% |
| Normal (8 hrs/day) | 50 units | 60 units | 83% | 17% |
| Off-peak (12 hrs/day) | 20 units | 25 units | 80% | 20% |
| Weighted Average | ~42 units | ~55 units | ~76% | ~24% |
You're now paying for 55 units on average instead of 100—45% cost reduction while maintaining comfortable headroom.
The cost equation:
Auto-scaling savings = (Fixed capacity - Average scaled capacity) × Cost per unit × Hours
Example:
Fixed: 100 instances × $0.10/hr × 730 hrs/month = $7,300/month
Scaled: 55 instances avg × $0.10/hr × 730 hrs/month = $4,015/month
Savings: $3,285/month (45%)
Traditional scaling targets 70-80% utilization to maintain headroom for spikes. For cost optimization, consider separate targets: scale out at 70% (protect availability), scale in at 40% (reclaim waste). Asymmetric thresholds prevent oscillation while maintaining cost efficiency.
Auto-scaling policies define when and how much to scale. While scale-out policies get the most attention (protecting against overload), scale-in policies drive cost savings.
Types of scaling policies:
1. Target Tracking Scaling
Maintains a target value for a metric (e.g., 70% CPU). The simplest approach, but can be suboptimal for cost:
# AWS Auto Scaling Policy
TargetTrackingScalingPolicyConfiguration:
TargetValue: 70.0 # Target 70% CPU
PredefinedMetricSpecification:
PredefinedMetricType: ASGAverageCPUUtilization
ScaleOutCooldown: 60 # Seconds before next scale-out
ScaleInCooldown: 300 # Seconds before next scale-in
Cost optimization tip: Use a longer ScaleInCooldown to prevent thrashing, but don't make it so long that you maintain excess capacity for extended periods.
2. Step Scaling
Defines different scaling adjustments based on metric thresholds. Allows aggressive scale-out but conservative scale-in:
StepAdjustments:
# Scale out aggressively
- MetricIntervalLowerBound: 0 # CPU 70-85%
MetricIntervalUpperBound: 15
ScalingAdjustment: 1 # Add 1 instance
- MetricIntervalLowerBound: 15 # CPU 85-100%
ScalingAdjustment: 3 # Add 3 instances (faster)
# Scale in conservatively
- MetricIntervalUpperBound: -20 # CPU 50-70% (target - 20)
ScalingAdjustment: -1 # Remove 1 instance
- MetricIntervalUpperBound: -40 # CPU < 30%
ScalingAdjustment: -2 # Remove 2 instances
3. Predictive Scaling
Uses machine learning to forecast demand and pre-scale capacity. Available in AWS and increasingly in other platforms:
# AWS Predictive Scaling
PredictiveScalingConfiguration:
MetricSpecifications:
- TargetValue: 70.0
PredefinedMetricPairSpecification:
PredefinedMetricType: ASGCPUUtilization
Mode: ForecastAndScale # Both predict and act
SchedulingBufferTime: 300 # Scale 5 min before predicted spike
MaxCapacityBreachBehavior: IncreaseMaxCapacity
Predictive scaling excels when:
Cost benefit: Predictive scaling often maintains lower average capacity than reactive scaling because it doesn't over-provision in response to spikes.
Without proper tuning, auto-scaling can oscillate—scaling up, then immediately scaling down, then up again. This wastes money (new instances cost time and initial resources) and creates instability. Use cooldown periods, asymmetric thresholds, and careful metric selection to prevent oscillation.
When you know when demand will change, scheduled scaling is more efficient than reactive scaling. It provides proactive capacity management with guaranteed scale-down.
When to use scheduled scaling:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
# Scheduled Scaling for Business Hours Application# Scale up for business hours, down for nights/weekends resource "aws_autoscaling_schedule" "scale_up_morning" { scheduled_action_name = "scale-up-morning" autoscaling_group_name = aws_autoscaling_group.app.name # Business hours start: 7 AM ET (12:00 UTC) recurrence = "0 12 * * 1-5" # Mon-Fri at 7 AM ET min_size = 10 max_size = 50 desired_capacity = 20} resource "aws_autoscaling_schedule" "scale_down_evening" { scheduled_action_name = "scale-down-evening" autoscaling_group_name = aws_autoscaling_group.app.name # Business hours end: 8 PM ET (01:00 UTC next day) recurrence = "0 1 * * 2-6" # Tue-Sat at 8 PM ET (prev day) min_size = 2 max_size = 20 desired_capacity = 5} resource "aws_autoscaling_schedule" "scale_down_weekend" { scheduled_action_name = "scale-down-weekend" autoscaling_group_name = aws_autoscaling_group.app.name # Weekend: minimal capacity recurrence = "0 1 * * 0,6" # Saturday and Sunday nights min_size = 1 max_size = 10 desired_capacity = 2} # Development environment: scale to zero at nightresource "aws_autoscaling_schedule" "dev_off" { count = var.environment == "development" ? 1 : 0 scheduled_action_name = "dev-scale-to-zero" autoscaling_group_name = aws_autoscaling_group.app.name recurrence = "0 2 * * *" # Every night at 9 PM ET min_size = 0 max_size = 0 desired_capacity = 0} resource "aws_autoscaling_schedule" "dev_on" { count = var.environment == "development" ? 1 : 0 scheduled_action_name = "dev-scale-up" autoscaling_group_name = aws_autoscaling_group.app.name recurrence = "0 13 * * 1-5" # Mon-Fri at 8 AM ET min_size = 1 max_size = 5 desired_capacity = 2}Combining scheduled and dynamic scaling:
The most effective approach combines scheduled scaling (predictable patterns) with dynamic scaling (unexpected fluctuations):
Scheduled scaling: Sets baseline capacity for time of day
├── 7 AM: min=10, max=50, desired=20
├── 8 PM: min=2, max=20, desired=5
└── Weekend: min=1, max=10, desired=2
Dynamic scaling: Adjusts within scheduled bounds
├── If CPU > 70%: scale out (up to max)
└── If CPU < 40%: scale in (down to min)
Scheduled scaling sets the permitted range; dynamic scaling adjusts within that range based on actual demand.
Savings calculation:
Scenario: 50-instance fixed capacity
With scheduled scaling:
Business hours (45 hrs/week): avg 35 instances
Off-hours (123 hrs/week): avg 10 instances
Weighted average: (45×35 + 123×10) / 168 = 16.6 instances avg
Savings: (50 - 16.6) / 50 = 67% reduction
Development, staging, and QA environments often run 24/7 despite being used only during business hours. Implement aggressive scheduled scaling or complete shutdown for non-production. A development environment that scales to zero from 7 PM to 8 AM saves 54% of compute costs with zero impact.
The ultimate cost optimization is scale-to-zero: running zero capacity when there's zero demand. This is the promise of serverless, but can be achieved with traditional compute as well.
Native scale-to-zero services:
| Service | Scale-to-Zero | Cold Start | Cost When Idle |
|---|---|---|---|
| Lambda | Native | 100ms - 10s | $0 |
| Fargate (ECS/EKS) | Native (with config) | 30s - 2min | $0 |
| Aurora Serverless v2 | To 0.5 ACU minimum | N/A (always warm) | ~$40/month minimum |
| DynamoDB On-Demand | Native | None | $0 (storage only) |
| API Gateway | Native | None | $0 (no requests) |
| App Runner | Native | Seconds | ~$5/month minimum |
| EKS with Karpenter | To 0 nodes | 30s - 2min | $0 for nodes |
Implementing scale-to-zero with containers:
Traditional container orchestration (ECS, Kubernetes) doesn't scale to zero by default because there's no mechanism to wake up instances when requests arrive. However, several patterns enable this:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
# KEDA ScaledObject: Scale to Zero Based on SQS Queue DepthapiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: order-processor-scaler namespace: productionspec: scaleTargetRef: name: order-processor # Deployment to scale pollingInterval: 15 # Check queue every 15s cooldownPeriod: 300 # Wait 5 min before scaling to zero minReplicaCount: 0 # Enable scale to zero! maxReplicaCount: 50 # Maximum scale triggers: - type: aws-sqs-queue metadata: queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders queueLength: "10" # Target: 10 messages per replica awsRegion: us-east-1 authenticationRef: name: aws-credentials ---# Alternative: HTTP-based scale to zero with prometheus metricsapiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: api-scaler namespace: productionspec: scaleTargetRef: name: api-deployment minReplicaCount: 0 maxReplicaCount: 20 triggers: - type: prometheus metadata: serverAddress: http://prometheus.monitoring:9090 metricName: http_requests_total threshold: "100" # 100 requests per replica query: | sum(rate(http_requests_total{service="api"}[2m]))---# Knative Service with scale-to-zeroapiVersion: serving.knative.dev/v1kind: Servicemetadata: name: order-api namespace: productionspec: template: metadata: annotations: # Scale to zero after 5 minutes of inactivity autoscaling.knative.dev/scale-to-zero-pod-retention-period: "5m" # Target 70% utilization autoscaling.knative.dev/target-utilization-percentage: "70" # Max scale autoscaling.knative.dev/max-scale: "50" spec: containers: - image: myregistry/order-api:latest ports: - containerPort: 8080 resources: requests: cpu: "200m" memory: "256Mi" limits: cpu: "1" memory: "512Mi"Cold start considerations:
Scale-to-zero introduces cold starts: the delay when scaling from zero to handle the first request. Managing cold starts is essential for user-facing applications:
| Strategy | Description | Use Case |
|---|---|---|
| Accept latency | First request waits for scale-up | Background processing, async APIs |
| Minimum of 1 | Keep at least 1 instance warm | Low-traffic APIs needing responsiveness |
| Provisioned concurrency | Pre-warm instances (Lambda) | Latency-sensitive Lambda functions |
| Request queuing | Queue requests during scale-up | Knative, KEDA with queue triggers |
| Hybrid | Scale to 1 during business hours, 0 overnight | Business-hours applications |
Scale-to-zero saves money but increases first-request latency. For user-facing applications, this trade-off may be unacceptable. Consider: keeping minimum=1 during business hours (still saves nights/weekends), accepting cold starts for internal tools, or using scale-to-zero only for async/batch workloads.
Kubernetes presents unique scaling challenges and opportunities. The two-level abstraction (pods and nodes) requires coordinated scaling for cost efficiency.
Pod scaling (HPA - Horizontal Pod Autoscaler):
HPA scales pods based on resource utilization or custom metrics. For cost optimization, configure HPA to scale aggressively:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Target 70% CPU
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Target 80% memory
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # React quickly to load
policies:
- type: Percent
value: 100 # Can double capacity
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scale-down
policies:
- type: Percent
value: 25 # Remove at most 25% at a time
periodSeconds: 60
Node scaling with Karpenter:
Karpenter provides more cost-efficient node scaling than Cluster Autoscaler:
Karpenter consolidation for cost:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: general-purpose
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- m6i.large
- m6i.xlarge
- m6i.2xlarge
- m6a.large
- m6a.xlarge
- c6i.xlarge
- c6i.2xlarge
disruption:
# Key for cost optimization:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 1m
# Alternative: WhenEmpty only removes empty nodes
# consolidationPolicy: WhenEmpty
# Cost limits
limits:
cpu: 500
memory: 1000Gi
Cluster Autoscaler scales node groups (ASGs) reactively based on pending pods. It's simpler but less cost-efficient. Karpenter provisions individual nodes with optimal sizing and actively consolidates. For cost optimization, Karpenter is significantly better—often 30-50% more efficient than Cluster Autoscaler.
Databases and stateful services present the greatest challenge for dynamic scaling. They can't simply be terminated and restarted like stateless compute. However, modern cloud databases offer scaling options:
Aurora Serverless v2:
AWS Aurora Serverless automatically scales database capacity based on load:
Capacity range: 0.5 ACU to 128 ACU
(1 ACU ≈ 2 GB RAM)
Scaling:
- Scales in seconds, not minutes
- Scales based on CPU, connections, memory
- Scales down when load decreases
- No downtime during scaling
Cost model:
- Pay per ACU-hour consumed
- Minimum cost: ~$43/month (0.5 ACU × 730 hours)
- vs. db.t3.medium: ~$50/month fixed
For variable workloads, Aurora Serverless can reduce costs by 40-60% compared to provisioned Aurora.
When Aurora Serverless v2 excels:
| Service | Scaling Type | Scale Range | Scaling Speed | Cost Impact |
|---|---|---|---|---|
| Aurora Serverless v2 | Automatic capacity | 0.5-128 ACU | Seconds | Variable, pay per use |
| Aurora Provisioned | Instance resize | db.t3 to db.r6g | Minutes (restart) | Fixed per instance |
| RDS | Instance resize | Full instance range | Minutes (restart) | Fixed per instance |
| DynamoDB On-Demand | Automatic | Unlimited | Instant | Pay per request |
| DynamoDB Provisioned | Manual/Auto | Defined RCU/WCU | Minutes | Fixed capacity |
| ElastiCache | Shard scaling | Node count | Minutes | Per node |
| Cosmos DB Serverless | Automatic | Pay per RU | Instant | Pay per request |
DynamoDB capacity modes:
DynamoDB offers two capacity modes with very different cost characteristics:
Provisioned capacity:
On-demand capacity:
Cost comparison example:
Workload: Average 100 reads/sec, 20 writes/sec
Provisioned: 100 RCU × $0.00013 + 20 WCU × $0.00065 = $0.026/hour = $19/month
On-Demand: (100×3600×730 reads × $0.25/1M) + (20×3600×730 writes × $1.25/1M)
= $65.70 + $65.70 = $131/month
For steady workload: Provisioned is ~7x cheaper
For variable workload: On-Demand may be cheaper if average is much lower than peak
Many tables benefit from starting on On-Demand (no capacity planning risk) then switching to Provisioned with auto-scaling once patterns are understood. You can switch modes once per 24 hours. Use On-Demand for development/new features; switch to Provisioned for production with known patterns.
To optimize auto-scaling for cost, you need to measure its effectiveness. Key metrics help you understand whether your scaling policies are working.
Key scaling efficiency metrics:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
-- Scaling Efficiency Analysis Queries-- For use with CloudWatch Metrics exported to a data warehouse -- 1. Average utilization by hour of day-- Identifies patterns for scheduled scalingSELECT EXTRACT(HOUR FROM timestamp) AS hour_of_day, EXTRACT(DOW FROM timestamp) AS day_of_week, AVG(cpu_utilization) AS avg_cpu, AVG(instance_count) AS avg_instances, MIN(instance_count) AS min_instances, MAX(instance_count) AS max_instancesFROM scaling_metricsWHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'GROUP BY 1, 2ORDER BY 2, 1; -- 2. Capacity waste analysis-- Calculates unused capacity (headroom)SELECT DATE(timestamp) AS date, SUM(instance_count * (100 - cpu_utilization) / 100) AS wasted_capacity_units, SUM(instance_count) AS total_capacity_units, ROUND( SUM(instance_count * (100 - cpu_utilization) / 100) / SUM(instance_count) * 100, 2 ) AS waste_percentageFROM scaling_metricsWHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'GROUP BY 1ORDER BY 1; -- 3. Scaling event analysisSELECT DATE(timestamp) AS date, SUM(CASE WHEN event_type = 'scale_out' THEN 1 ELSE 0 END) AS scale_outs, SUM(CASE WHEN event_type = 'scale_in' THEN 1 ELSE 0 END) AS scale_ins, ROUND( SUM(CASE WHEN event_type = 'scale_in' THEN 1 ELSE 0 END)::DECIMAL / NULLIF(COUNT(*), 0) * 100, 2 ) AS scale_in_ratio_pctFROM scaling_eventsWHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'GROUP BY 1ORDER BY 1; -- 4. Cost savings from scaling-- Compare actual spend to hypothetical fixed capacityWITH actual_cost AS ( SELECT DATE(timestamp) AS date, SUM(instance_count * hourly_rate / 60) AS actual_cost FROM scaling_metrics m JOIN instance_pricing p ON m.instance_type = p.instance_type GROUP BY 1),fixed_cost AS ( SELECT DATE(timestamp) AS date, MAX(instance_count) * 24 * p.hourly_rate AS fixed_cost FROM scaling_metrics m JOIN instance_pricing p ON m.instance_type = p.instance_type GROUP BY 1)SELECT a.date, a.actual_cost, f.fixed_cost, f.fixed_cost - a.actual_cost AS savings, ROUND((f.fixed_cost - a.actual_cost) / f.fixed_cost * 100, 2) AS savings_pctFROM actual_cost aJOIN fixed_cost f ON a.date = f.dateORDER BY 1;Building a scaling cost dashboard:
Create a dashboard that visualizes:
Target benchmarks:
| Metric | Poor | Acceptable | Good | Excellent |
|---|---|---|---|---|
| Avg Utilization | <40% | 40-60% | 60-75% | 75-85% |
| Waste % | >50% | 30-50% | 20-30% | <20% |
| Scale-in Ratio | <30% | 30-45% | 45-55% | ~50% |
| Scaling Savings | <20% | 20-35% | 35-50% | >50% |
Auto-scaling transforms cloud infrastructure from a fixed cost to a variable cost that tracks demand. When implemented with cost in mind, it can reduce compute spending by 40-60% while maintaining or improving availability. Let's consolidate the key concepts:
What's next:
With capacity optimized through right-sizing and auto-scaling, you need visibility to maintain and improve these gains over time. The final page in this module explores Cost Monitoring Tools—the dashboards, alerts, and analytics that provide ongoing visibility into cloud spending and optimization opportunities.
You now understand how to use auto-scaling as a cost optimization strategy, not just an availability tool. The combination of right-sizing (what you provision), purchasing (how you pay), and auto-scaling (when you provision) forms a comprehensive compute cost optimization framework. Next, we'll complete the picture with cost monitoring and governance.