Loading learning content...
One of Kubernetes' most powerful capabilities is auto-scaling—the ability to automatically adjust compute resources in response to changing demand. Done right, auto-scaling delivers the dream of elastic infrastructure: your application seamlessly handles traffic spikes without manual intervention, then scales down during quiet periods to minimize costs.
But auto-scaling is not magic. Poorly configured autoscalers can cause oscillation (rapid scale up/down cycles), fail to respond quickly enough to traffic spikes, or scale too aggressively and waste resources. Mastering auto-scaling requires understanding the different scaling dimensions, the algorithms that drive them, and the intricate interplay between resource configuration and scaling behavior.
Kubernetes provides three complementary scaling mechanisms:
This page provides a deep, production-focused exploration of each mechanism.
By the end of this page, you'll understand how to configure HPA with built-in and custom metrics, implement VPA for automatic resource right-sizing, coordinate between HPA and VPA, integrate with Cluster Autoscaler, and design scaling strategies that balance responsiveness with stability.
Before diving into specific autoscalers, it's essential to understand the different dimensions along which Kubernetes can scale.
Horizontal Scaling (Scale Out/In)
Adds or removes pod replicas while keeping each pod's resource allocation constant. This is the most common scaling approach for stateless services.
Advantages:
Limitations:
Vertical Scaling (Scale Up/Down)
Increases or decreases the CPU/memory allocated to each pod. Requires pod restart to apply (in most cases).
Advantages:
Limitations:
Cluster Scaling (Node Addition/Removal)
Adds or removes nodes from the cluster. Necessary when existing nodes are fully utilized.
| Dimension | What Changes | Speed | Disruption | Limit |
|---|---|---|---|---|
| Horizontal (HPA) | Number of pods | Fast (seconds) | None to existing pods | Cluster capacity |
| Vertical (VPA) | Pod CPU/memory | Slow (minutes) | Pod restart required | Node capacity |
| Cluster (CA) | Number of nodes | Slow (minutes) | None to existing pods | Cloud/budget limit |
The most robust scaling strategy combines all three dimensions: HPA responds quickly to load changes, VPA right-sizes pods during off-peak hours, and Cluster Autoscaler ensures nodes exist to accommodate the workload. Getting these to work together harmoniously is the key challenge.
The Horizontal Pod Autoscaler (HPA) is Kubernetes' most commonly used autoscaler. It automatically scales the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed metrics.
How HPA Works:
Metric Collection: HPA controller queries the Metrics Server (or custom metrics adapter) at regular intervals (default: 15 seconds)
Desired Replica Calculation: Using the target metric value, HPA calculates the desired replica count:
desiredReplicas = ceil(currentReplicas × (currentMetricValue / targetMetricValue))
Stabilization: HPA applies stabilization windows to prevent oscillation (scale-up and scale-down have different windows)
Scaling Action: If desired replicas differs from current, HPA patches the target's spec.replicas
HPA API Versions:
autoscaling/v1: Basic CPU-only scalingautoscaling/v2: Full-featured with memory, custom metrics, and behaviors (use this)1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: web-service-hpa namespace: productionspec: # Target workload to scale scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: web-service # Replica bounds minReplicas: 3 # Never go below 3 replicas maxReplicas: 50 # Never exceed 50 replicas # Metrics to scale on (multiple metrics combined) metrics: # Metric 1: CPU utilization (most common) - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 # Target 70% CPU utilization # Metric 2: Memory utilization - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 # Target 80% memory utilization # Metric 3: Custom metric (requests per second per pod) - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: "1000" # Target 1000 RPS per pod # Metric 4: External metric (queue depth) - type: External external: metric: name: sqs_queue_depth selector: matchLabels: queue: orders target: type: Value value: "100" # Scale when queue > 100 messages # Scaling behavior (v2 feature) behavior: # Scale-up behavior: Aggressive for responsiveness scaleUp: stabilizationWindowSeconds: 0 # Immediate scale-up policies: - type: Percent value: 100 # Allow doubling replicas periodSeconds: 15 - type: Pods value: 4 # Or add 4 pods periodSeconds: 15 selectPolicy: Max # Use whichever adds more pods # Scale-down behavior: Conservative for stability scaleDown: stabilizationWindowSeconds: 300 # 5-minute window policies: - type: Percent value: 10 # Remove at most 10% of replicas periodSeconds: 60 selectPolicy: MaxUnderstanding the HPA Algorithm:
When multiple metrics are specified, HPA calculates the desired replica count for each metric independently, then takes the maximum. This ensures the deployment can handle whichever constraint is most demanding.
Calculation Example:
Critical HPA Prerequisites:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yamlcurrent / request. Without requests, utilization is undefined and HPA won't scale on resources.The behavior section in HPA (introduced in autoscaling/v2) gives fine-grained control over how quickly scaling happens. This is crucial for preventing thrashing (rapid scale-up followed by immediate scale-down) while maintaining responsiveness.
Stabilization Windows:
The stabilization window is a lookback period where HPA considers all calculated replica values and chooses the highest (for scale-up stability) or lowest (for scale-down stability).
Default stabilization:
This asymmetry reflects operational reality: scaling up should be fast to handle load, but scaling down should be cautious to avoid premature capacity removal.
Scaling Policies:
Policies define how much can be scaled in a given period. You can specify multiple policies and use selectPolicy to choose between them.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
# Pattern 1: Aggressive scale-up, very conservative scale-down# Use case: Customer-facing services where availability is criticalbehavior: scaleUp: stabilizationWindowSeconds: 0 policies: - type: Percent value: 200 # Allow tripling periodSeconds: 15 selectPolicy: Max scaleDown: stabilizationWindowSeconds: 600 # 10-minute window policies: - type: Pods value: 1 # Only remove 1 pod periodSeconds: 120 # Every 2 minutes selectPolicy: Max ---# Pattern 2: Balanced scaling for predictable workloads# Use case: Internal services with predictable trafficbehavior: scaleUp: stabilizationWindowSeconds: 60 # Wait 1 minute before scaling up policies: - type: Pods value: 2 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 180 # 3-minute window policies: - type: Percent value: 25 # Remove at most 25% periodSeconds: 60 ---# Pattern 3: Disable scale-down entirely (scale-up only)# Use case: Pre-scaling before known events, manual scale-downbehavior: scaleDown: selectPolicy: Disabled ---# Pattern 4: Rapid scaling for batch processors# Use case: Queue workers that should scale fast in both directionsbehavior: scaleUp: stabilizationWindowSeconds: 0 policies: - type: Pods value: 10 periodSeconds: 15 scaleDown: stabilizationWindowSeconds: 30 policies: - type: Percent value: 50 periodSeconds: 30Without proper stabilization, HPA can thrash: traffic spike causes scale-up, new pods absorb load, utilization drops, HPA scales down, remaining pods become overloaded, HPA scales up again. The 5-minute default scale-down window prevents this, but if you reduce it, monitor for oscillation patterns.
While CPU and memory are useful, many workloads need to scale on business metrics: requests per second, queue depth, active connections, or custom application metrics. This requires a custom metrics pipeline.
Custom Metrics Architecture:
Metric Types:
type: Pods): Metric value per pod (e.g., requests_per_second averaged across pods)type: Object): Metric from a specific Kubernetes object (e.g., Ingress request count)type: External): Metric from external system (e.g., cloud queue depth)1234567891011121314151617181920212223242526272829303132333435363738
# Prometheus Adapter configuration for custom metricsapiVersion: v1kind: ConfigMapmetadata: name: prometheus-adapter-config namespace: monitoringdata: config.yaml: | rules: # Rule 1: HTTP requests per second per pod - seriesQuery: 'http_requests_total{namespace!="",pod!=""}' resources: overrides: namespace: {resource: "namespace"} pod: {resource: "pod"} name: matches: "^(.*)_total$" as: "1_per_second" metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)' # Rule 2: Queue depth from external queue - seriesQuery: 'sqs_queue_messages_visible{queue_name!=""}' resources: template: "<<.Resource>>" name: matches: "^sqs_queue_messages_visible$" as: "sqs_queue_depth" metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (queue_name)' # Rule 3: Active WebSocket connections - seriesQuery: 'websocket_active_connections{namespace!="",pod!=""}' resources: overrides: namespace: {resource: "namespace"} pod: {resource: "pod"} name: as: "active_connections" metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'123456789101112131415161718192021222324252627282930313233343536373839404142
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: api-gateway-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api-gateway minReplicas: 5 maxReplicas: 100 metrics: # Scale on requests per second per pod - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: "500" # Each pod handles 500 RPS # Scale on p99 latency (scale up if latency too high) - type: Pods pods: metric: name: http_request_duration_p99 target: type: AverageValue averageValue: "200m" # Target p99 < 200ms # Scale on external queue depth - type: External external: metric: name: sqs_queue_depth selector: matchLabels: queue_name: api-requests target: type: AverageValue averageValue: "10" # 10 messages per podThe best scaling metric is one that directly reflects user-facing impact. For web services, requests-per-second is often better than CPU because it captures actual demand. For queue workers, queue depth is ideal. For latency-sensitive services, consider scaling on p95/p99 latency—if latency rises, add capacity.
The Vertical Pod Autoscaler (VPA) automatically adjusts CPU and memory requests/limits for containers. Unlike HPA which adds replicas, VPA resizes existing pods—often requiring restarts.
VPA Components:
VPA Update Modes:
When to Use VPA:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
apiVersion: autoscaling.k8s.io/v1kind: VerticalPodAutoscalermetadata: name: backend-service-vpa namespace: productionspec: # Target workload targetRef: apiVersion: apps/v1 kind: Deployment name: backend-service # Update policy updatePolicy: updateMode: "Auto" # Automatically apply recommendations minReplicas: 2 # Don't disrupt if only 1 replica running # Resource policy (constraints on recommendations) resourcePolicy: containerPolicies: - containerName: main-app # Mode for this container mode: "Auto" # Minimum resources (never recommend less than this) minAllowed: cpu: "100m" memory: "128Mi" # Maximum resources (never recommend more than this) maxAllowed: cpu: "4" memory: "8Gi" # Which resources VPA can modify controlledResources: ["cpu", "memory"] # Control values (can set for resource-specific control) controlledValues: RequestsAndLimits - containerName: sidecar # Don't let VPA touch the sidecar mode: "Off" ---# VPA in recommendation-only mode (safe starting point)apiVersion: autoscaling.k8s.io/v1kind: VerticalPodAutoscalermetadata: name: my-app-vpa-recommenderspec: targetRef: apiVersion: apps/v1 kind: Deployment name: my-app updatePolicy: updateMode: "Off" # Only recommend, don't apply resourcePolicy: containerPolicies: - containerName: "*" minAllowed: cpu: "50m" memory: "64Mi" maxAllowed: cpu: "8" memory: "16Gi"Viewing VPA Recommendations:
# Get VPA status and recommendations
kubectl describe vpa backend-service-vpa
# Output shows:
# - Target: Current resource values
# - Lower Bound: Minimum recommended
# - Upper Bound: Maximum recommended
# - Uncapped Target: Recommendation ignoring minAllowed/maxAllowed
VPA Recommendation Types:
VPA cannot be used with HPA on the same metrics. If HPA scales on CPU and VPA adjusts CPU requests, they fight each other. The solution: use HPA for scaling replicas on custom metrics (RPS, queue depth), and VPA for resource right-sizing on CPU/memory. Or use VPA in 'Off' mode just for recommendations.
Using HPA and VPA together requires careful coordination to avoid conflicts. The fundamental issue: HPA uses resource utilization (current/request) for scaling decisions. If VPA changes requests, it affects utilization calculations.
Pattern 1: HPA on Custom Metrics, VPA on Resources
The safest approach: configure HPA to scale on custom metrics only (RPS, queue depth, latency) while VPA manages resource requests. Since HPA doesn't use resource utilization, there's no conflict.
Pattern 2: VPA Recommendations Only
Run VPA in Off mode to get recommendations without automatic application. Use recommendations to manually tune resource requests during planned maintenance windows.
Pattern 3: Multidimensional Pod Autoscaler (Beta)
The Kubernetes project is developing a Multidimensional Pod Autoscaler (MPA) that unifies HPA and VPA decision-making. Until it's stable, use the patterns below.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
# Pattern 1: HPA on custom metrics, VPA on resources# This is the recommended production pattern # HPA: scales replicas based on requests per secondapiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: api-server-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api-server minReplicas: 3 maxReplicas: 30 metrics: # Only use custom metrics - NOT cpu/memory - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: "1000" behavior: scaleDown: stabilizationWindowSeconds: 300 ---# VPA: manages CPU and memory for optimal sizingapiVersion: autoscaling.k8s.io/v1kind: VerticalPodAutoscalermetadata: name: api-server-vpaspec: targetRef: apiVersion: apps/v1 kind: Deployment name: api-server updatePolicy: updateMode: "Auto" resourcePolicy: containerPolicies: - containerName: api-container minAllowed: cpu: "200m" memory: "256Mi" maxAllowed: cpu: "2" memory: "4Gi" controlledResources: ["cpu", "memory"]| HPA Metric | VPA Mode | Compatible? | Risk |
|---|---|---|---|
| CPU/Memory | Auto/Recreate | ❌ No | Scaling conflicts |
| CPU/Memory | Off | ✅ Yes | Manual tuning needed |
| Custom (RPS) | Auto/Recreate | ✅ Yes | None |
| Custom (RPS) | Off | ✅ Yes | None, safest option |
| Mixed (CPU + Custom) | Auto | ⚠️ Risky | Partial conflicts |
The Kubernetes autoscaling SIG is developing the Multidimensional Pod Autoscaler (MPA) to solve HPA+VPA coordination. MPA will provide a single controller that makes unified scaling decisions across replica count and resource requests. Monitor KEP-2353 for progress.
HPA and VPA scale pods, but what if cluster capacity is exhausted? The Cluster Autoscaler (CA) adds and removes nodes from cloud provider node groups based on pending pods and underutilization.
How Cluster Autoscaler Works:
Scale-Up:
Pending state with Unschedulable conditionScale-Down:
Scale-Down Blockers:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
# Cluster Autoscaler deployment (AWS EKS example)apiVersion: apps/v1kind: Deploymentmetadata: name: cluster-autoscaler namespace: kube-systemspec: replicas: 1 selector: matchLabels: app: cluster-autoscaler template: metadata: labels: app: cluster-autoscaler spec: serviceAccountName: cluster-autoscaler containers: - name: cluster-autoscaler image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.28.0 command: - ./cluster-autoscaler - --v=4 - --stderrthreshold=info - --cloud-provider=aws - --skip-nodes-with-local-storage=false - --expander=least-waste # Node selection strategy - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster - --balance-similar-node-groups # Balance across AZs - --scale-down-enabled=true - --scale-down-delay-after-add=10m # Wait 10m after scale-up - --scale-down-delay-after-delete=0s - --scale-down-unneeded-time=10m # Must be unneeded for 10m - --scale-down-utilization-threshold=0.5 # Below 50% = underutilized - --max-node-provision-time=15m - --max-graceful-termination-sec=600 resources: limits: cpu: 100m memory: 600Mi requests: cpu: 100m memory: 600Mi priorityClassName: system-cluster-criticalKarpenter is an open-source, high-performance Kubernetes cluster autoscaler designed by AWS. Unlike Cluster Autoscaler which relies on predefined node groups, Karpenter provisions nodes on-demand with the exact specifications needed for pending pods.
Key Differences from Cluster Autoscaler:
| Cluster Autoscaler | Karpenter |
|---|---|
| Scales predefined node groups | Provisions nodes based on pod requirements |
| Limited instance type flexibility | Chooses from all compatible instance types |
| Slower (waits for ASG scaling) | Faster (direct EC2 API calls) |
| Simpler setup | More powerful but more configuration |
How Karpenter Works:
Karpenter enables:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
# Karpenter NodePool (v1beta1 API)apiVersion: karpenter.sh/v1beta1kind: NodePoolmetadata: name: defaultspec: # Template for nodes provisioned by this pool template: spec: # Requirements for node selection requirements: - key: kubernetes.io/arch operator: In values: ["amd64", "arm64"] - key: karpenter.sh/capacity-type operator: In values: ["on-demand", "spot"] - key: karpenter.k8s.aws/instance-category operator: In values: ["c", "m", "r"] # Compute, general, memory-optimized - key: karpenter.k8s.aws/instance-size operator: In values: ["medium", "large", "xlarge", "2xlarge"] - key: topology.kubernetes.io/zone operator: In values: ["us-west-2a", "us-west-2b", "us-west-2c"] # Node configuration nodeClassRef: name: default # Taints applied to nodes (optional) taints: - key: example.com/special-hardware effect: NoSchedule value: "true" # Limits on total resources this pool can provision limits: cpu: 1000 memory: 2000Gi # Disruption settings disruption: # Consolidation: pack pods onto fewer nodes when possible consolidationPolicy: WhenUnderutilized # or WhenEmpty consolidateAfter: 30s # Budget for disruptions budgets: - nodes: "10%" # Allow disrupting 10% of nodes at a time ---# EC2NodeClass: AWS-specific node configurationapiVersion: karpenter.k8s.aws/v1beta1kind: EC2NodeClassmetadata: name: defaultspec: amiFamily: AL2 # Amazon Linux 2 subnetSelectorTerms: - tags: karpenter.sh/discovery: my-cluster securityGroupSelectorTerms: - tags: karpenter.sh/discovery: my-cluster role: KarpenterNodeRole-my-cluster # Block device configuration blockDeviceMappings: - deviceName: /dev/xvda ebs: volumeSize: 100Gi volumeType: gp3 encrypted: trueUse Karpenter if: You're on AWS (or Azure preview), need fast scaling, want optimal bin-packing, use mixed instance types. Use Cluster Autoscaler if: You're on non-AWS clouds, need simpler configuration, have stable, predictable node requirements, or use GKE Autopilot (built-in scaling).
Auto-scaling failures often stem from configuration mistakes rather than autoscaler bugs. Here are the most common anti-patterns and their solutions:
Anti-Pattern 1: Missing Resource Requests
HPA calculates utilization as current / request. Without requests, utilization is undefined. Pods show 0% or unknown utilization, and HPA cannot scale.
Solution: Always set resource requests. Use VPA in 'Off' mode to get recommendations if unsure.
Anti-Pattern 2: maxReplicas Too Low
HPA hits maxReplicas during traffic spike. Even though utilization is high, no more pods are added. Users experience degradation.
Solution: Set maxReplicas with headroom for unexpected spikes. Consider 2-3x peak expected replicas.
Anti-Pattern 3: Slow Pod Startup
HPA adds pods, but they take 60+ seconds to start serving traffic. During this window, existing pods remain overloaded, triggering more scale-up. Result: over-provisioning followed by mass scale-down.
Solution: Optimize container startup time. Use proper readiness probes. Consider pod priority and preemption for faster scheduling.
12345678910111213141516171819202122232425262728293031323334
#!/bin/bash# Troubleshooting HPA issues # Check HPA status and eventskubectl describe hpa <hpa-name> -n <namespace> # Key things to look for:# - Current/Desired replicas# - Current metrics vs targets# - Conditions (ScalingActive, AbleToScale, ScalingLimited)# - Events (scaling decisions, errors) # Check if metrics server is workingkubectl top pods -n <namespace>kubectl top nodes # If metrics unavailable, check metrics-serverkubectl get pods -n kube-system -l k8s-app=metrics-serverkubectl logs -n kube-system -l k8s-app=metrics-server # Check custom metrics APIkubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq . # Check external metrics API kubectl get --raw /apis/external.metrics.k8s.io/v1beta1 | jq . # Debug Cluster Autoscalerkubectl logs -n kube-system -l app=cluster-autoscaler --tail=100 # Check for pending pods (indicates CA should scale up)kubectl get pods --all-namespaces --field-selector=status.phase=Pending # Check node utilization (CA scale-down decisions)kubectl describe nodes | grep -A5 "Allocated resources"Effective auto-scaling is essential for running efficient, responsive Kubernetes workloads. Let's consolidate the key principles:
What's Next:
With auto-scaling in place, your deployments can respond dynamically to load. The next page covers Rolling Updates and Rollbacks—strategies for safely deploying changes to running applications without downtime, and quickly reverting when things go wrong.
You now understand Kubernetes auto-scaling in depth—from HPA and VPA fundamentals through custom metrics, Cluster Autoscaler, Karpenter, and production patterns for combining autoscalers effectively. Apply these principles to build self-adjusting infrastructure that balances cost and performance.