System Design (HLD)Cloud Cost Optimization

Cloud Cost Optimization

LevelIntermediate

Duration90 mins

TopicCloud Cost Optimization

4 / 5

Auto-Scaling for Cost

Pay Only for What You Need, When You Need It

A B2B SaaS company serving enterprise customers noticed a peculiar pattern: their infrastructure ran at near-capacity during business hours (9 AM - 6 PM EST) but dropped to 15% utilization during nights and weekends. Yet they were paying full price 24/7—168 hours per week for a workload that only needed full capacity for 45 hours.

After implementing aggressive auto-scaling policies, they reduced their instance count from a fixed 100 to a dynamic 20-100 range. The result: a 45% reduction in compute costs without any change to user experience. Their infrastructure now breathes with their business.

Auto-scaling is typically framed as an availability tool—scaling up to handle load spikes. But it's equally powerful as a cost optimization strategy—scaling down when demand decreases. The cloud's promise is "pay for what you use," but that only works if you're not running idle capacity.

This page reframes auto-scaling through a cost lens: how to design scaling policies that minimize waste, implement scale-to-zero patterns, and build systems that automatically match capacity to demand.

What You Will Learn

By the end of this page, you will understand how to design auto-scaling for cost efficiency: configuring scale-down policies, implementing scheduled scaling, building scale-to-zero architectures, and measuring the cost impact of your scaling strategies.

The Economics of Dynamic Capacity

To understand why auto-scaling is a cost strategy, we need to understand the difference between peak capacity and average demand. Most workloads exhibit significant variability:

Temporal patterns:

Diurnal (daily) — Business applications peak during work hours, drop 50-80% overnight
Weekly — B2B apps drop significantly on weekends; consumer apps may peak on weekends
Seasonal — Retail spikes at holidays; tax software peaks in April; education peaks in September
Event-driven — Marketing campaigns, product launches, viral content

The fixed-capacity problem:

Without auto-scaling, you must provision for peak capacity:

Time Period	Demand	Fixed Capacity	Utilization	Waste
Peak (4 hrs/day)	100 units	100 units	100%	0%
Normal (8 hrs/day)	50 units	100 units	50%	50%
Off-peak (12 hrs/day)	20 units	100 units	20%	80%
Weighted Average	~42 units	100 units	42%	58%

You're paying for 100 units of capacity but only using 42 on average—58% waste.

The dynamic-capacity opportunity:

With effective auto-scaling:

Time Period	Demand	Dynamic Capacity	Utilization	Waste
Peak (4 hrs/day)	100 units	120 units (+20% buffer)	83%	17%
Normal (8 hrs/day)	50 units	60 units	83%	17%
Off-peak (12 hrs/day)	20 units	25 units	80%	20%
Weighted Average	~42 units	~55 units	~76%	~24%

You're now paying for 55 units on average instead of 100—45% cost reduction while maintaining comfortable headroom.

The cost equation:

Auto-scaling savings = (Fixed capacity - Average scaled capacity) × Cost per unit × Hours

Example:
  Fixed: 100 instances × $0.10/hr × 730 hrs/month = $7,300/month
  Scaled: 55 instances avg × $0.10/hr × 730 hrs/month = $4,015/month
  Savings: $3,285/month (45%)

Utilization Target Philosophy

Traditional scaling targets 70-80% utilization to maintain headroom for spikes. For cost optimization, consider separate targets: scale out at 70% (protect availability), scale in at 40% (reclaim waste). Asymmetric thresholds prevent oscillation while maintaining cost efficiency.

Scaling Policies for Cost Efficiency

Auto-scaling policies define when and how much to scale. While scale-out policies get the most attention (protecting against overload), scale-in policies drive cost savings.

Types of scaling policies:

1. Target Tracking Scaling

Maintains a target value for a metric (e.g., 70% CPU). The simplest approach, but can be suboptimal for cost:

# AWS Auto Scaling Policy
TargetTrackingScalingPolicyConfiguration:
  TargetValue: 70.0              # Target 70% CPU
  PredefinedMetricSpecification:
    PredefinedMetricType: ASGAverageCPUUtilization
  ScaleOutCooldown: 60           # Seconds before next scale-out
  ScaleInCooldown: 300           # Seconds before next scale-in

Cost optimization tip: Use a longer ScaleInCooldown to prevent thrashing, but don't make it so long that you maintain excess capacity for extended periods.

2. Step Scaling

Defines different scaling adjustments based on metric thresholds. Allows aggressive scale-out but conservative scale-in:

StepAdjustments:
  # Scale out aggressively
  - MetricIntervalLowerBound: 0    # CPU 70-85%
    MetricIntervalUpperBound: 15
    ScalingAdjustment: 1           # Add 1 instance
  - MetricIntervalLowerBound: 15   # CPU 85-100%
    ScalingAdjustment: 3           # Add 3 instances (faster)
  # Scale in conservatively  
  - MetricIntervalUpperBound: -20  # CPU 50-70% (target - 20)
    ScalingAdjustment: -1          # Remove 1 instance
  - MetricIntervalUpperBound: -40  # CPU < 30%
    ScalingAdjustment: -2          # Remove 2 instances

Scale-In Optimization Strategies

•Asymmetric thresholds — Scale out at 70% utilization, scale in at 40%. The gap prevents oscillation and ensures you don't prematurely remove capacity.
•Longer scale-in cooldowns — Wait longer before scaling in (300-600s) than scaling out (60-120s). Quick scale-out protects users; slower scale-in prevents thrashing.
•Scale-in protection for new instances — Don't terminate instances recently launched (they may still be warming up or receiving traffic).
•Business hours awareness — During peak hours, use higher minimum capacity and more conservative scale-in. During off-hours, scale aggressively.
•Instance termination policies — Choose which instances to terminate wisely: oldest first (better for updates), newest first (avoid churn), or allocation strategy (optimize spot).

3. Predictive Scaling

Uses machine learning to forecast demand and pre-scale capacity. Available in AWS and increasingly in other platforms:

# AWS Predictive Scaling
PredictiveScalingConfiguration:
  MetricSpecifications:
    - TargetValue: 70.0
      PredefinedMetricPairSpecification:
        PredefinedMetricType: ASGCPUUtilization
  Mode: ForecastAndScale    # Both predict and act
  SchedulingBufferTime: 300  # Scale 5 min before predicted spike
  MaxCapacityBreachBehavior: IncreaseMaxCapacity

Predictive scaling excels when:

You have repeating patterns (daily/weekly cycles)
Scale-out lag causes issues (cold starts, instance launch time)
You want to reduce reactive over-scaling

Cost benefit: Predictive scaling often maintains lower average capacity than reactive scaling because it doesn't over-provision in response to spikes.

Scaling Oscillation

Without proper tuning, auto-scaling can oscillate—scaling up, then immediately scaling down, then up again. This wastes money (new instances cost time and initial resources) and creates instability. Use cooldown periods, asymmetric thresholds, and careful metric selection to prevent oscillation.

Scheduled Scaling

When you know when demand will change, scheduled scaling is more efficient than reactive scaling. It provides proactive capacity management with guaranteed scale-down.

When to use scheduled scaling:

Business hour patterns — Scale up at 8 AM, scale down at 6 PM
Known events — Marketing campaigns, product launches, events
Development/staging environments — Scale to zero overnight and weekends
Batch windows — Scale up for nightly ETL, scale down after
Geographic time zones — Scale regional capacity based on local business hours

scheduled-scaling.yaml

Terraform

# Scheduled Scaling for Business Hours Application
# Scale up for business hours, down for nights/weekends
 
resource "aws_autoscaling_schedule" "scale_up_morning" {
  scheduled_action_name  = "scale-up-morning"
  autoscaling_group_name = aws_autoscaling_group.app.name
  
  # Business hours start: 7 AM ET (12:00 UTC)
  recurrence = "0 12 * * 1-5"  # Mon-Fri at 7 AM ET
  
  min_size         = 10
  max_size         = 50
  desired_capacity = 20
}
 
resource "aws_autoscaling_schedule" "scale_down_evening" {
  scheduled_action_name  = "scale-down-evening"
  autoscaling_group_name = aws_autoscaling_group.app.name
  
  # Business hours end: 8 PM ET (01:00 UTC next day)
  recurrence = "0 1 * * 2-6"  # Tue-Sat at 8 PM ET (prev day)
  
  min_size         = 2
  max_size         = 20
  desired_capacity = 5
}
 
resource "aws_autoscaling_schedule" "scale_down_weekend" {
  scheduled_action_name  = "scale-down-weekend"
  autoscaling_group_name = aws_autoscaling_group.app.name
  
  # Weekend: minimal capacity
  recurrence = "0 1 * * 0,6"  # Saturday and Sunday nights
  
  min_size         = 1
  max_size         = 10
  desired_capacity = 2
}
 
# Development environment: scale to zero at night
resource "aws_autoscaling_schedule" "dev_off" {
  count = var.environment == "development" ? 1 : 0
  
  scheduled_action_name  = "dev-scale-to-zero"
  autoscaling_group_name = aws_autoscaling_group.app.name
  
  recurrence = "0 2 * * *"  # Every night at 9 PM ET
  
  min_size         = 0
  max_size         = 0
  desired_capacity = 0
}
 
resource "aws_autoscaling_schedule" "dev_on" {
  count = var.environment == "development" ? 1 : 0
  
  scheduled_action_name  = "dev-scale-up"
  autoscaling_group_name = aws_autoscaling_group.app.name
  
  recurrence = "0 13 * * 1-5"  # Mon-Fri at 8 AM ET
  
  min_size         = 1
  max_size         = 5
  desired_capacity = 2
}

Combining scheduled and dynamic scaling:

The most effective approach combines scheduled scaling (predictable patterns) with dynamic scaling (unexpected fluctuations):

Scheduled scaling: Sets baseline capacity for time of day
  ├── 7 AM: min=10, max=50, desired=20
  ├── 8 PM: min=2, max=20, desired=5
  └── Weekend: min=1, max=10, desired=2

Dynamic scaling: Adjusts within scheduled bounds
  ├── If CPU > 70%: scale out (up to max)
  └── If CPU < 40%: scale in (down to min)

Scheduled scaling sets the permitted range; dynamic scaling adjusts within that range based on actual demand.

Savings calculation:

Scenario: 50-instance fixed capacity

With scheduled scaling:
  Business hours (45 hrs/week): avg 35 instances
  Off-hours (123 hrs/week): avg 10 instances
  Weighted average: (45×35 + 123×10) / 168 = 16.6 instances avg

Savings: (50 - 16.6) / 50 = 67% reduction

Non-Production Environments

Development, staging, and QA environments often run 24/7 despite being used only during business hours. Implement aggressive scheduled scaling or complete shutdown for non-production. A development environment that scales to zero from 7 PM to 8 AM saves 54% of compute costs with zero impact.

Scale-to-Zero Architectures

The ultimate cost optimization is scale-to-zero: running zero capacity when there's zero demand. This is the promise of serverless, but can be achieved with traditional compute as well.

Native scale-to-zero services:

AWS Services with Scale-to-Zero Capability
Service	Scale-to-Zero	Cold Start	Cost When Idle
Lambda	Native	100ms - 10s	$0
Fargate (ECS/EKS)	Native (with config)	30s - 2min	$0
Aurora Serverless v2	To 0.5 ACU minimum	N/A (always warm)	~$40/month minimum
DynamoDB On-Demand	Native	None	$0 (storage only)
API Gateway	Native	None	$0 (no requests)
App Runner	Native	Seconds	~$5/month minimum
EKS with Karpenter	To 0 nodes	30s - 2min	$0 for nodes

Implementing scale-to-zero with containers:

Traditional container orchestration (ECS, Kubernetes) doesn't scale to zero by default because there's no mechanism to wake up instances when requests arrive. However, several patterns enable this:

Scale-to-Zero Patterns

•KEDA (Kubernetes Event-Driven Autoscaling) — Scales Kubernetes deployments to zero based on external metrics (queue depth, HTTP requests, cron). When events arrive, KEDA triggers scale-up from zero.
•Knative Serving — Provides request-based autoscaling with scale-to-zero. Requests are queued during scale-up, and Knative manages cold starts gracefully.
•AWS App Runner — Managed container service with built-in scale-to-zero. Simpler than ECS/EKS for applications that benefit from this pattern.
•Azure Container Apps — Scale-to-zero based on HTTP requests, KEDA scalers, or schedules. Native Azure serverless container option.
•Cloud Run (GCP) — Google's serverless container platform with request-based scaling and scale-to-zero.

keda-scale-to-zero.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# KEDA ScaledObject: Scale to Zero Based on SQS Queue Depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor  # Deployment to scale
  
  pollingInterval: 15      # Check queue every 15s
  cooldownPeriod: 300      # Wait 5 min before scaling to zero
  
  minReplicaCount: 0       # Enable scale to zero!
  maxReplicaCount: 50      # Maximum scale
  
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
        queueLength: "10"  # Target: 10 messages per replica
        awsRegion: us-east-1
      authenticationRef:
        name: aws-credentials
 
---
# Alternative: HTTP-based scale to zero with prometheus metrics
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: api-deployment
  
  minReplicaCount: 0
  maxReplicaCount: 20
  
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: http_requests_total
        threshold: "100"  # 100 requests per replica
        query: |
          sum(rate(http_requests_total{service="api"}[2m]))
---
# Knative Service with scale-to-zero
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: order-api
  namespace: production
spec:
  template:
    metadata:
      annotations:
        # Scale to zero after 5 minutes of inactivity
        autoscaling.knative.dev/scale-to-zero-pod-retention-period: "5m"
        # Target 70% utilization
        autoscaling.knative.dev/target-utilization-percentage: "70"
        # Max scale
        autoscaling.knative.dev/max-scale: "50"
    spec:
      containers:
        - image: myregistry/order-api:latest
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "1"
              memory: "512Mi"

Cold start considerations:

Scale-to-zero introduces cold starts: the delay when scaling from zero to handle the first request. Managing cold starts is essential for user-facing applications:

Strategy	Description	Use Case
Accept latency	First request waits for scale-up	Background processing, async APIs
Minimum of 1	Keep at least 1 instance warm	Low-traffic APIs needing responsiveness
Provisioned concurrency	Pre-warm instances (Lambda)	Latency-sensitive Lambda functions
Request queuing	Queue requests during scale-up	Knative, KEDA with queue triggers
Hybrid	Scale to 1 during business hours, 0 overnight	Business-hours applications

Scale-to-Zero Trade-offs

Scale-to-zero saves money but increases first-request latency. For user-facing applications, this trade-off may be unacceptable. Consider: keeping minimum=1 during business hours (still saves nights/weekends), accepting cold starts for internal tools, or using scale-to-zero only for async/batch workloads.

Kubernetes Cost-Aware Scaling

Kubernetes presents unique scaling challenges and opportunities. The two-level abstraction (pods and nodes) requires coordinated scaling for cost efficiency.

Pod scaling (HPA - Horizontal Pod Autoscaler):

HPA scales pods based on resource utilization or custom metrics. For cost optimization, configure HPA to scale aggressively:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  
  minReplicas: 2
  maxReplicas: 50
  
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70  # Target 70% CPU
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80  # Target 80% memory
  
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # React quickly to load
      policies:
        - type: Percent
          value: 100                    # Can double capacity
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min before scale-down
      policies:
        - type: Percent
          value: 25                     # Remove at most 25% at a time
          periodSeconds: 60

Node scaling with Karpenter:

Karpenter provides more cost-efficient node scaling than Cluster Autoscaler:

Right-sized nodes — Selects instance types that match pending pod requirements
Consolidation — Automatically replaces under-utilized nodes with smaller ones
Spot integration — Seamlessly uses spot instances with fallback to on-demand
Fast provisioning — Bypasses ASG delays, provisions nodes in seconds

Karpenter consolidation for cost:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: general-purpose
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - m6i.large
            - m6i.xlarge
            - m6i.2xlarge
            - m6a.large
            - m6a.xlarge
            - c6i.xlarge
            - c6i.2xlarge
  
  disruption:
    # Key for cost optimization:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 1m
    
    # Alternative: WhenEmpty only removes empty nodes
    # consolidationPolicy: WhenEmpty

  # Cost limits
  limits:
    cpu: 500
    memory: 1000Gi

Kubernetes Cost Scaling Best Practices

•Use VPA for right-sizing, HPA for scaling — VPA optimizes resource requests; HPA adjusts replica count. Together they minimize both per-pod waste and unnecessary pods.
•Enable Karpenter consolidation — Consolidation is the key cost feature. It actively relocates workloads to reduce node count.
•Prefer spot for stateless workloads — Configure Karpenter to prefer spot instances with on-demand fallback. Automatic handling of interruptions.
•Set resource requests accurately — Both HPA and Karpenter rely on resource requests. Over-requesting leads to under-utilized nodes.
•Configure Pod Disruption Budgets — PDBs protect workloads during consolidation. Without them, Karpenter won't move pods.
•Monitor bin-packing efficiency — Track node utilization and unused allocatable capacity. Too much headroom = too much cost.

Cluster Autoscaler vs Karpenter

Cluster Autoscaler scales node groups (ASGs) reactively based on pending pods. It's simpler but less cost-efficient. Karpenter provisions individual nodes with optimal sizing and actively consolidates. For cost optimization, Karpenter is significantly better—often 30-50% more efficient than Cluster Autoscaler.

Database and Stateful Scaling

Databases and stateful services present the greatest challenge for dynamic scaling. They can't simply be terminated and restarted like stateless compute. However, modern cloud databases offer scaling options:

Aurora Serverless v2:

AWS Aurora Serverless automatically scales database capacity based on load:

Capacity range: 0.5 ACU to 128 ACU
(1 ACU ≈ 2 GB RAM)

Scaling:
  - Scales in seconds, not minutes
  - Scales based on CPU, connections, memory
  - Scales down when load decreases
  - No downtime during scaling

Cost model:
  - Pay per ACU-hour consumed
  - Minimum cost: ~$43/month (0.5 ACU × 730 hours)
  - vs. db.t3.medium: ~$50/month fixed

For variable workloads, Aurora Serverless can reduce costs by 40-60% compared to provisioned Aurora.

When Aurora Serverless v2 excels:

Development and staging environments
Applications with variable or unpredictable load
Multi-tenant applications with tenant-specific load patterns
Disaster recovery with mostly-idle replicas

Database Scaling Options Comparison
Service	Scaling Type	Scale Range	Scaling Speed	Cost Impact
Aurora Serverless v2	Automatic capacity	0.5-128 ACU	Seconds	Variable, pay per use
Aurora Provisioned	Instance resize	db.t3 to db.r6g	Minutes (restart)	Fixed per instance
RDS	Instance resize	Full instance range	Minutes (restart)	Fixed per instance
DynamoDB On-Demand	Automatic	Unlimited	Instant	Pay per request
DynamoDB Provisioned	Manual/Auto	Defined RCU/WCU	Minutes	Fixed capacity
ElastiCache	Shard scaling	Node count	Minutes	Per node
Cosmos DB Serverless	Automatic	Pay per RU	Instant	Pay per request

DynamoDB capacity modes:

DynamoDB offers two capacity modes with very different cost characteristics:

Provisioned capacity:

You specify read (RCU) and write (WCU) capacity units
Consistent pricing regardless of actual usage
Can use auto-scaling to adjust within limits
Reserved capacity available for additional discounts
Best for: Steady, predictable workloads

On-demand capacity:

Pay per request (read/write)
No capacity planning needed
Automatically handles any traffic level
Higher per-request cost than optimized provisioned
Best for: Variable workloads, unknown patterns, spiky traffic

Cost comparison example:

Workload: Average 100 reads/sec, 20 writes/sec
Provisioned: 100 RCU × $0.00013 + 20 WCU × $0.00065 = $0.026/hour = $19/month
On-Demand: (100×3600×730 reads × $0.25/1M) + (20×3600×730 writes × $1.25/1M) 
         = $65.70 + $65.70 = $131/month

For steady workload: Provisioned is ~7x cheaper
For variable workload: On-Demand may be cheaper if average is much lower than peak

Hybrid DynamoDB Strategy

Many tables benefit from starting on On-Demand (no capacity planning risk) then switching to Provisioned with auto-scaling once patterns are understood. You can switch modes once per 24 hours. Use On-Demand for development/new features; switch to Provisioned for production with known patterns.

Measuring Scaling Cost Impact

To optimize auto-scaling for cost, you need to measure its effectiveness. Key metrics help you understand whether your scaling policies are working.

Key scaling efficiency metrics:

Scaling Efficiency Metrics

•Average utilization — What percentage of provisioned capacity is actually used? Target: 60-75% for production with headroom, 40-50% may indicate over-provisioning.
•Capacity vs demand curve — Does scaled capacity closely follow demand, or is there significant gap (waste)?
•Scale-in ratio — What percentage of scaling events are scale-ins vs scale-outs? If most events are scale-outs, your minimum might be too low.
•Idle resource hours — How many resource-hours are spent at minimum capacity with <20% utilization?
•Scaling latency — Time from high utilization to capacity increase. High latency = reactive over-scaling.
•Oscillation rate — How often does capacity bounce between values? High oscillation = policy tuning needed.

scaling-metrics.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
-- Scaling Efficiency Analysis Queries
-- For use with CloudWatch Metrics exported to a data warehouse
 
-- 1. Average utilization by hour of day
-- Identifies patterns for scheduled scaling
SELECT 
    EXTRACT(HOUR FROM timestamp) AS hour_of_day,
    EXTRACT(DOW FROM timestamp) AS day_of_week,
    AVG(cpu_utilization) AS avg_cpu,
    AVG(instance_count) AS avg_instances,
    MIN(instance_count) AS min_instances,
    MAX(instance_count) AS max_instances
FROM scaling_metrics
WHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1, 2
ORDER BY 2, 1;
 
-- 2. Capacity waste analysis
-- Calculates unused capacity (headroom)
SELECT 
    DATE(timestamp) AS date,
    SUM(instance_count * (100 - cpu_utilization) / 100) AS wasted_capacity_units,
    SUM(instance_count) AS total_capacity_units,
    ROUND(
        SUM(instance_count * (100 - cpu_utilization) / 100) / 
        SUM(instance_count) * 100, 2
    ) AS waste_percentage
FROM scaling_metrics
WHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1
ORDER BY 1;
 
-- 3. Scaling event analysis
SELECT 
    DATE(timestamp) AS date,
    SUM(CASE WHEN event_type = 'scale_out' THEN 1 ELSE 0 END) AS scale_outs,
    SUM(CASE WHEN event_type = 'scale_in' THEN 1 ELSE 0 END) AS scale_ins,
    ROUND(
        SUM(CASE WHEN event_type = 'scale_in' THEN 1 ELSE 0 END)::DECIMAL / 
        NULLIF(COUNT(*), 0) * 100, 2
    ) AS scale_in_ratio_pct
FROM scaling_events
WHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1
ORDER BY 1;
 
-- 4. Cost savings from scaling
-- Compare actual spend to hypothetical fixed capacity
WITH actual_cost AS (
    SELECT 
        DATE(timestamp) AS date,
        SUM(instance_count * hourly_rate / 60) AS actual_cost
    FROM scaling_metrics m
    JOIN instance_pricing p ON m.instance_type = p.instance_type
    GROUP BY 1
),
fixed_cost AS (
    SELECT 
        DATE(timestamp) AS date,
        MAX(instance_count) * 24 * p.hourly_rate AS fixed_cost
    FROM scaling_metrics m
    JOIN instance_pricing p ON m.instance_type = p.instance_type
    GROUP BY 1
)
SELECT 
    a.date,
    a.actual_cost,
    f.fixed_cost,
    f.fixed_cost - a.actual_cost AS savings,
    ROUND((f.fixed_cost - a.actual_cost) / f.fixed_cost * 100, 2) AS savings_pct
FROM actual_cost a
JOIN fixed_cost f ON a.date = f.date
ORDER BY 1;

Building a scaling cost dashboard:

Create a dashboard that visualizes:

Capacity vs Demand — Overlay of provisioned capacity and actual utilization over time
Cost Trend — Daily/weekly compute costs with scaling vs hypothetical fixed
Scaling Events — Timeline of scale in/out events with triggers
Idle Resource Time — Hours at minimum capacity with low utilization
Efficiency Score — Composite metric: (demand served / capacity paid for) × 100

Target benchmarks:

Metric	Poor	Acceptable	Good	Excellent
Avg Utilization	<40%	40-60%	60-75%	75-85%
Waste %	>50%	30-50%	20-30%	<20%
Scale-in Ratio	<30%	30-45%	45-55%	~50%
Scaling Savings	<20%	20-35%	35-50%	>50%

Summary: Auto-Scaling for Cost

Auto-scaling transforms cloud infrastructure from a fixed cost to a variable cost that tracks demand. When implemented with cost in mind, it can reduce compute spending by 40-60% while maintaining or improving availability. Let's consolidate the key concepts:

Key Takeaways

•Scale-down is the cost optimization — Scale-out protects availability; scale-in reduces waste. Optimize scale-in policies for cost savings.
•Scheduled scaling for predictable patterns — If you know when demand drops (nights, weekends), schedule capacity reductions rather than waiting for metrics.
•Scale-to-zero for maximum savings — Serverless and KEDA-style scaling eliminates idle cost entirely, but requires handling cold starts.
•Kubernetes needs coordinated scaling — HPA for pods, Karpenter for nodes, VPA for right-sizing. All three work together for cost efficiency.
•Databases can scale too — Aurora Serverless, DynamoDB On-Demand, and similar services bring dynamic scaling to stateful workloads.
•Measure to improve — Track utilization, waste percentage, and scaling savings. You can't optimize what you don't measure.
•Combine with other strategies — Auto-scaling amplifies right-sizing (scale right-sized instances) and spot (scale spot capacity).

What's next:

With capacity optimized through right-sizing and auto-scaling, you need visibility to maintain and improve these gains over time. The final page in this module explores Cost Monitoring Tools—the dashboards, alerts, and analytics that provide ongoing visibility into cloud spending and optimization opportunities.

Page Complete

You now understand how to use auto-scaling as a cost optimization strategy, not just an availability tool. The combination of right-sizing (what you provision), purchasing (how you pay), and auto-scaling (when you provision) forms a comprehensive compute cost optimization framework. Next, we'll complete the picture with cost monitoring and governance.

4 / 5

Loading learning content...

System Design (HLD)Cloud Cost Optimization

Cloud Cost Optimization

LevelIntermediate

Duration90 mins

TopicCloud Cost Optimization

4 / 5

Auto-Scaling for Cost

Pay Only for What You Need, When You Need It

What You Will Learn

The Economics of Dynamic Capacity

To understand why auto-scaling is a cost strategy, we need to understand the difference between peak capacity and average demand. Most workloads exhibit significant variability:

Temporal patterns:

Diurnal (daily) — Business applications peak during work hours, drop 50-80% overnight
Weekly — B2B apps drop significantly on weekends; consumer apps may peak on weekends
Seasonal — Retail spikes at holidays; tax software peaks in April; education peaks in September
Event-driven — Marketing campaigns, product launches, viral content

The fixed-capacity problem:

Without auto-scaling, you must provision for peak capacity:

Time Period	Demand	Fixed Capacity	Utilization	Waste
Peak (4 hrs/day)	100 units	100 units	100%	0%
Normal (8 hrs/day)	50 units	100 units	50%	50%
Off-peak (12 hrs/day)	20 units	100 units	20%	80%
Weighted Average	~42 units	100 units	42%	58%

You're paying for 100 units of capacity but only using 42 on average—58% waste.

The dynamic-capacity opportunity:

With effective auto-scaling:

Time Period	Demand	Dynamic Capacity	Utilization	Waste
Peak (4 hrs/day)	100 units	120 units (+20% buffer)	83%	17%
Normal (8 hrs/day)	50 units	60 units	83%	17%
Off-peak (12 hrs/day)	20 units	25 units	80%	20%
Weighted Average	~42 units	~55 units	~76%	~24%

You're now paying for 55 units on average instead of 100—45% cost reduction while maintaining comfortable headroom.

The cost equation:

Auto-scaling savings = (Fixed capacity - Average scaled capacity) × Cost per unit × Hours

Example:
  Fixed: 100 instances × $0.10/hr × 730 hrs/month = $7,300/month
  Scaled: 55 instances avg × $0.10/hr × 730 hrs/month = $4,015/month
  Savings: $3,285/month (45%)

Utilization Target Philosophy

Scaling Policies for Cost Efficiency

Auto-scaling policies define when and how much to scale. While scale-out policies get the most attention (protecting against overload), scale-in policies drive cost savings.

Types of scaling policies:

1. Target Tracking Scaling

Maintains a target value for a metric (e.g., 70% CPU). The simplest approach, but can be suboptimal for cost:

# AWS Auto Scaling Policy
TargetTrackingScalingPolicyConfiguration:
  TargetValue: 70.0              # Target 70% CPU
  PredefinedMetricSpecification:
    PredefinedMetricType: ASGAverageCPUUtilization
  ScaleOutCooldown: 60           # Seconds before next scale-out
  ScaleInCooldown: 300           # Seconds before next scale-in

Cost optimization tip: Use a longer ScaleInCooldown to prevent thrashing, but don't make it so long that you maintain excess capacity for extended periods.

2. Step Scaling

Defines different scaling adjustments based on metric thresholds. Allows aggressive scale-out but conservative scale-in:

StepAdjustments:
  # Scale out aggressively
  - MetricIntervalLowerBound: 0    # CPU 70-85%
    MetricIntervalUpperBound: 15
    ScalingAdjustment: 1           # Add 1 instance
  - MetricIntervalLowerBound: 15   # CPU 85-100%
    ScalingAdjustment: 3           # Add 3 instances (faster)
  # Scale in conservatively  
  - MetricIntervalUpperBound: -20  # CPU 50-70% (target - 20)
    ScalingAdjustment: -1          # Remove 1 instance
  - MetricIntervalUpperBound: -40  # CPU < 30%
    ScalingAdjustment: -2          # Remove 2 instances

Scale-In Optimization Strategies

•Asymmetric thresholds — Scale out at 70% utilization, scale in at 40%. The gap prevents oscillation and ensures you don't prematurely remove capacity.
•Longer scale-in cooldowns — Wait longer before scaling in (300-600s) than scaling out (60-120s). Quick scale-out protects users; slower scale-in prevents thrashing.
•Scale-in protection for new instances — Don't terminate instances recently launched (they may still be warming up or receiving traffic).
•Business hours awareness — During peak hours, use higher minimum capacity and more conservative scale-in. During off-hours, scale aggressively.
•Instance termination policies — Choose which instances to terminate wisely: oldest first (better for updates), newest first (avoid churn), or allocation strategy (optimize spot).

3. Predictive Scaling

Uses machine learning to forecast demand and pre-scale capacity. Available in AWS and increasingly in other platforms:

# AWS Predictive Scaling
PredictiveScalingConfiguration:
  MetricSpecifications:
    - TargetValue: 70.0
      PredefinedMetricPairSpecification:
        PredefinedMetricType: ASGCPUUtilization
  Mode: ForecastAndScale    # Both predict and act
  SchedulingBufferTime: 300  # Scale 5 min before predicted spike
  MaxCapacityBreachBehavior: IncreaseMaxCapacity

Predictive scaling excels when:

You have repeating patterns (daily/weekly cycles)
Scale-out lag causes issues (cold starts, instance launch time)
You want to reduce reactive over-scaling

Cost benefit: Predictive scaling often maintains lower average capacity than reactive scaling because it doesn't over-provision in response to spikes.

Scaling Oscillation

Scheduled Scaling

When you know when demand will change, scheduled scaling is more efficient than reactive scaling. It provides proactive capacity management with guaranteed scale-down.

When to use scheduled scaling:

Business hour patterns — Scale up at 8 AM, scale down at 6 PM
Known events — Marketing campaigns, product launches, events
Development/staging environments — Scale to zero overnight and weekends
Batch windows — Scale up for nightly ETL, scale down after
Geographic time zones — Scale regional capacity based on local business hours

scheduled-scaling.yaml

Terraform

# Scheduled Scaling for Business Hours Application
# Scale up for business hours, down for nights/weekends
 
resource "aws_autoscaling_schedule" "scale_up_morning" {
  scheduled_action_name  = "scale-up-morning"
  autoscaling_group_name = aws_autoscaling_group.app.name
  
  # Business hours start: 7 AM ET (12:00 UTC)
  recurrence = "0 12 * * 1-5"  # Mon-Fri at 7 AM ET
  
  min_size         = 10
  max_size         = 50
  desired_capacity = 20
}
 
resource "aws_autoscaling_schedule" "scale_down_evening" {
  scheduled_action_name  = "scale-down-evening"
  autoscaling_group_name = aws_autoscaling_group.app.name
  
  # Business hours end: 8 PM ET (01:00 UTC next day)
  recurrence = "0 1 * * 2-6"  # Tue-Sat at 8 PM ET (prev day)
  
  min_size         = 2
  max_size         = 20
  desired_capacity = 5
}
 
resource "aws_autoscaling_schedule" "scale_down_weekend" {
  scheduled_action_name  = "scale-down-weekend"
  autoscaling_group_name = aws_autoscaling_group.app.name
  
  # Weekend: minimal capacity
  recurrence = "0 1 * * 0,6"  # Saturday and Sunday nights
  
  min_size         = 1
  max_size         = 10
  desired_capacity = 2
}
 
# Development environment: scale to zero at night
resource "aws_autoscaling_schedule" "dev_off" {
  count = var.environment == "development" ? 1 : 0
  
  scheduled_action_name  = "dev-scale-to-zero"
  autoscaling_group_name = aws_autoscaling_group.app.name
  
  recurrence = "0 2 * * *"  # Every night at 9 PM ET
  
  min_size         = 0
  max_size         = 0
  desired_capacity = 0
}
 
resource "aws_autoscaling_schedule" "dev_on" {
  count = var.environment == "development" ? 1 : 0
  
  scheduled_action_name  = "dev-scale-up"
  autoscaling_group_name = aws_autoscaling_group.app.name
  
  recurrence = "0 13 * * 1-5"  # Mon-Fri at 8 AM ET
  
  min_size         = 1
  max_size         = 5
  desired_capacity = 2
}

Combining scheduled and dynamic scaling:

The most effective approach combines scheduled scaling (predictable patterns) with dynamic scaling (unexpected fluctuations):

Scheduled scaling: Sets baseline capacity for time of day
  ├── 7 AM: min=10, max=50, desired=20
  ├── 8 PM: min=2, max=20, desired=5
  └── Weekend: min=1, max=10, desired=2

Dynamic scaling: Adjusts within scheduled bounds
  ├── If CPU > 70%: scale out (up to max)
  └── If CPU < 40%: scale in (down to min)

Scheduled scaling sets the permitted range; dynamic scaling adjusts within that range based on actual demand.

Savings calculation:

Scenario: 50-instance fixed capacity

With scheduled scaling:
  Business hours (45 hrs/week): avg 35 instances
  Off-hours (123 hrs/week): avg 10 instances
  Weighted average: (45×35 + 123×10) / 168 = 16.6 instances avg

Savings: (50 - 16.6) / 50 = 67% reduction

Non-Production Environments

Scale-to-Zero Architectures

The ultimate cost optimization is scale-to-zero: running zero capacity when there's zero demand. This is the promise of serverless, but can be achieved with traditional compute as well.

Native scale-to-zero services:

AWS Services with Scale-to-Zero Capability
Service	Scale-to-Zero	Cold Start	Cost When Idle
Lambda	Native	100ms - 10s	$0
Fargate (ECS/EKS)	Native (with config)	30s - 2min	$0
Aurora Serverless v2	To 0.5 ACU minimum	N/A (always warm)	~$40/month minimum
DynamoDB On-Demand	Native	None	$0 (storage only)
API Gateway	Native	None	$0 (no requests)
App Runner	Native	Seconds	~$5/month minimum
EKS with Karpenter	To 0 nodes	30s - 2min	$0 for nodes

Implementing scale-to-zero with containers:

Traditional container orchestration (ECS, Kubernetes) doesn't scale to zero by default because there's no mechanism to wake up instances when requests arrive. However, several patterns enable this:

Scale-to-Zero Patterns

•KEDA (Kubernetes Event-Driven Autoscaling) — Scales Kubernetes deployments to zero based on external metrics (queue depth, HTTP requests, cron). When events arrive, KEDA triggers scale-up from zero.
•Knative Serving — Provides request-based autoscaling with scale-to-zero. Requests are queued during scale-up, and Knative manages cold starts gracefully.
•AWS App Runner — Managed container service with built-in scale-to-zero. Simpler than ECS/EKS for applications that benefit from this pattern.
•Azure Container Apps — Scale-to-zero based on HTTP requests, KEDA scalers, or schedules. Native Azure serverless container option.
•Cloud Run (GCP) — Google's serverless container platform with request-based scaling and scale-to-zero.

keda-scale-to-zero.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# KEDA ScaledObject: Scale to Zero Based on SQS Queue Depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor  # Deployment to scale
  
  pollingInterval: 15      # Check queue every 15s
  cooldownPeriod: 300      # Wait 5 min before scaling to zero
  
  minReplicaCount: 0       # Enable scale to zero!
  maxReplicaCount: 50      # Maximum scale
  
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
        queueLength: "10"  # Target: 10 messages per replica
        awsRegion: us-east-1
      authenticationRef:
        name: aws-credentials
 
---
# Alternative: HTTP-based scale to zero with prometheus metrics
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: api-deployment
  
  minReplicaCount: 0
  maxReplicaCount: 20
  
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: http_requests_total
        threshold: "100"  # 100 requests per replica
        query: |
          sum(rate(http_requests_total{service="api"}[2m]))
---
# Knative Service with scale-to-zero
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: order-api
  namespace: production
spec:
  template:
    metadata:
      annotations:
        # Scale to zero after 5 minutes of inactivity
        autoscaling.knative.dev/scale-to-zero-pod-retention-period: "5m"
        # Target 70% utilization
        autoscaling.knative.dev/target-utilization-percentage: "70"
        # Max scale
        autoscaling.knative.dev/max-scale: "50"
    spec:
      containers:
        - image: myregistry/order-api:latest
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "1"
              memory: "512Mi"

Cold start considerations:

Scale-to-zero introduces cold starts: the delay when scaling from zero to handle the first request. Managing cold starts is essential for user-facing applications:

Strategy	Description	Use Case
Accept latency	First request waits for scale-up	Background processing, async APIs
Minimum of 1	Keep at least 1 instance warm	Low-traffic APIs needing responsiveness
Provisioned concurrency	Pre-warm instances (Lambda)	Latency-sensitive Lambda functions
Request queuing	Queue requests during scale-up	Knative, KEDA with queue triggers
Hybrid	Scale to 1 during business hours, 0 overnight	Business-hours applications

Scale-to-Zero Trade-offs

Kubernetes Cost-Aware Scaling

Kubernetes presents unique scaling challenges and opportunities. The two-level abstraction (pods and nodes) requires coordinated scaling for cost efficiency.

Pod scaling (HPA - Horizontal Pod Autoscaler):

HPA scales pods based on resource utilization or custom metrics. For cost optimization, configure HPA to scale aggressively:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  
  minReplicas: 2
  maxReplicas: 50
  
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70  # Target 70% CPU
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80  # Target 80% memory
  
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # React quickly to load
      policies:
        - type: Percent
          value: 100                    # Can double capacity
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min before scale-down
      policies:
        - type: Percent
          value: 25                     # Remove at most 25% at a time
          periodSeconds: 60

Node scaling with Karpenter:

Karpenter provides more cost-efficient node scaling than Cluster Autoscaler:

Right-sized nodes — Selects instance types that match pending pod requirements
Consolidation — Automatically replaces under-utilized nodes with smaller ones
Spot integration — Seamlessly uses spot instances with fallback to on-demand
Fast provisioning — Bypasses ASG delays, provisions nodes in seconds

Karpenter consolidation for cost:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: general-purpose
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - m6i.large
            - m6i.xlarge
            - m6i.2xlarge
            - m6a.large
            - m6a.xlarge
            - c6i.xlarge
            - c6i.2xlarge
  
  disruption:
    # Key for cost optimization:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 1m
    
    # Alternative: WhenEmpty only removes empty nodes
    # consolidationPolicy: WhenEmpty

  # Cost limits
  limits:
    cpu: 500
    memory: 1000Gi

Kubernetes Cost Scaling Best Practices

•Use VPA for right-sizing, HPA for scaling — VPA optimizes resource requests; HPA adjusts replica count. Together they minimize both per-pod waste and unnecessary pods.
•Enable Karpenter consolidation — Consolidation is the key cost feature. It actively relocates workloads to reduce node count.
•Prefer spot for stateless workloads — Configure Karpenter to prefer spot instances with on-demand fallback. Automatic handling of interruptions.
•Set resource requests accurately — Both HPA and Karpenter rely on resource requests. Over-requesting leads to under-utilized nodes.
•Configure Pod Disruption Budgets — PDBs protect workloads during consolidation. Without them, Karpenter won't move pods.
•Monitor bin-packing efficiency — Track node utilization and unused allocatable capacity. Too much headroom = too much cost.

Cluster Autoscaler vs Karpenter

Database and Stateful Scaling

Aurora Serverless v2:

AWS Aurora Serverless automatically scales database capacity based on load:

Capacity range: 0.5 ACU to 128 ACU
(1 ACU ≈ 2 GB RAM)

Scaling:
  - Scales in seconds, not minutes
  - Scales based on CPU, connections, memory
  - Scales down when load decreases
  - No downtime during scaling

Cost model:
  - Pay per ACU-hour consumed
  - Minimum cost: ~$43/month (0.5 ACU × 730 hours)
  - vs. db.t3.medium: ~$50/month fixed

For variable workloads, Aurora Serverless can reduce costs by 40-60% compared to provisioned Aurora.

When Aurora Serverless v2 excels:

Development and staging environments
Applications with variable or unpredictable load
Multi-tenant applications with tenant-specific load patterns
Disaster recovery with mostly-idle replicas

Database Scaling Options Comparison
Service	Scaling Type	Scale Range	Scaling Speed	Cost Impact
Aurora Serverless v2	Automatic capacity	0.5-128 ACU	Seconds	Variable, pay per use
Aurora Provisioned	Instance resize	db.t3 to db.r6g	Minutes (restart)	Fixed per instance
RDS	Instance resize	Full instance range	Minutes (restart)	Fixed per instance
DynamoDB On-Demand	Automatic	Unlimited	Instant	Pay per request
DynamoDB Provisioned	Manual/Auto	Defined RCU/WCU	Minutes	Fixed capacity
ElastiCache	Shard scaling	Node count	Minutes	Per node
Cosmos DB Serverless	Automatic	Pay per RU	Instant	Pay per request

DynamoDB capacity modes:

DynamoDB offers two capacity modes with very different cost characteristics:

Provisioned capacity:

You specify read (RCU) and write (WCU) capacity units
Consistent pricing regardless of actual usage
Can use auto-scaling to adjust within limits
Reserved capacity available for additional discounts
Best for: Steady, predictable workloads

On-demand capacity:

Pay per request (read/write)
No capacity planning needed
Automatically handles any traffic level
Higher per-request cost than optimized provisioned
Best for: Variable workloads, unknown patterns, spiky traffic

Cost comparison example:

Workload: Average 100 reads/sec, 20 writes/sec
Provisioned: 100 RCU × $0.00013 + 20 WCU × $0.00065 = $0.026/hour = $19/month
On-Demand: (100×3600×730 reads × $0.25/1M) + (20×3600×730 writes × $1.25/1M) 
         = $65.70 + $65.70 = $131/month

For steady workload: Provisioned is ~7x cheaper
For variable workload: On-Demand may be cheaper if average is much lower than peak

Hybrid DynamoDB Strategy

Measuring Scaling Cost Impact

To optimize auto-scaling for cost, you need to measure its effectiveness. Key metrics help you understand whether your scaling policies are working.

Key scaling efficiency metrics:

Scaling Efficiency Metrics

•Average utilization — What percentage of provisioned capacity is actually used? Target: 60-75% for production with headroom, 40-50% may indicate over-provisioning.
•Capacity vs demand curve — Does scaled capacity closely follow demand, or is there significant gap (waste)?
•Scale-in ratio — What percentage of scaling events are scale-ins vs scale-outs? If most events are scale-outs, your minimum might be too low.
•Idle resource hours — How many resource-hours are spent at minimum capacity with <20% utilization?
•Scaling latency — Time from high utilization to capacity increase. High latency = reactive over-scaling.
•Oscillation rate — How often does capacity bounce between values? High oscillation = policy tuning needed.

scaling-metrics.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
-- Scaling Efficiency Analysis Queries
-- For use with CloudWatch Metrics exported to a data warehouse
 
-- 1. Average utilization by hour of day
-- Identifies patterns for scheduled scaling
SELECT 
    EXTRACT(HOUR FROM timestamp) AS hour_of_day,
    EXTRACT(DOW FROM timestamp) AS day_of_week,
    AVG(cpu_utilization) AS avg_cpu,
    AVG(instance_count) AS avg_instances,
    MIN(instance_count) AS min_instances,
    MAX(instance_count) AS max_instances
FROM scaling_metrics
WHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1, 2
ORDER BY 2, 1;
 
-- 2. Capacity waste analysis
-- Calculates unused capacity (headroom)
SELECT 
    DATE(timestamp) AS date,
    SUM(instance_count * (100 - cpu_utilization) / 100) AS wasted_capacity_units,
    SUM(instance_count) AS total_capacity_units,
    ROUND(
        SUM(instance_count * (100 - cpu_utilization) / 100) / 
        SUM(instance_count) * 100, 2
    ) AS waste_percentage
FROM scaling_metrics
WHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1
ORDER BY 1;
 
-- 3. Scaling event analysis
SELECT 
    DATE(timestamp) AS date,
    SUM(CASE WHEN event_type = 'scale_out' THEN 1 ELSE 0 END) AS scale_outs,
    SUM(CASE WHEN event_type = 'scale_in' THEN 1 ELSE 0 END) AS scale_ins,
    ROUND(
        SUM(CASE WHEN event_type = 'scale_in' THEN 1 ELSE 0 END)::DECIMAL / 
        NULLIF(COUNT(*), 0) * 100, 2
    ) AS scale_in_ratio_pct
FROM scaling_events
WHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1
ORDER BY 1;
 
-- 4. Cost savings from scaling
-- Compare actual spend to hypothetical fixed capacity
WITH actual_cost AS (
    SELECT 
        DATE(timestamp) AS date,
        SUM(instance_count * hourly_rate / 60) AS actual_cost
    FROM scaling_metrics m
    JOIN instance_pricing p ON m.instance_type = p.instance_type
    GROUP BY 1
),
fixed_cost AS (
    SELECT 
        DATE(timestamp) AS date,
        MAX(instance_count) * 24 * p.hourly_rate AS fixed_cost
    FROM scaling_metrics m
    JOIN instance_pricing p ON m.instance_type = p.instance_type
    GROUP BY 1
)
SELECT 
    a.date,
    a.actual_cost,
    f.fixed_cost,
    f.fixed_cost - a.actual_cost AS savings,
    ROUND((f.fixed_cost - a.actual_cost) / f.fixed_cost * 100, 2) AS savings_pct
FROM actual_cost a
JOIN fixed_cost f ON a.date = f.date
ORDER BY 1;

Building a scaling cost dashboard:

Create a dashboard that visualizes:

Capacity vs Demand — Overlay of provisioned capacity and actual utilization over time
Cost Trend — Daily/weekly compute costs with scaling vs hypothetical fixed
Scaling Events — Timeline of scale in/out events with triggers
Idle Resource Time — Hours at minimum capacity with low utilization
Efficiency Score — Composite metric: (demand served / capacity paid for) × 100

Target benchmarks:

Metric	Poor	Acceptable	Good	Excellent
Avg Utilization	<40%	40-60%	60-75%	75-85%
Waste %	>50%	30-50%	20-30%	<20%
Scale-in Ratio	<30%	30-45%	45-55%	~50%
Scaling Savings	<20%	20-35%	35-50%	>50%

Summary: Auto-Scaling for Cost

Key Takeaways

•Scale-down is the cost optimization — Scale-out protects availability; scale-in reduces waste. Optimize scale-in policies for cost savings.
•Scheduled scaling for predictable patterns — If you know when demand drops (nights, weekends), schedule capacity reductions rather than waiting for metrics.
•Scale-to-zero for maximum savings — Serverless and KEDA-style scaling eliminates idle cost entirely, but requires handling cold starts.
•Kubernetes needs coordinated scaling — HPA for pods, Karpenter for nodes, VPA for right-sizing. All three work together for cost efficiency.
•Databases can scale too — Aurora Serverless, DynamoDB On-Demand, and similar services bring dynamic scaling to stateful workloads.
•Measure to improve — Track utilization, waste percentage, and scaling savings. You can't optimize what you don't measure.
•Combine with other strategies — Auto-scaling amplifies right-sizing (scale right-sized instances) and spot (scale spot capacity).

What's next:

Page Complete

4 / 5