System Design (HLD)Kubernetes Operations

Kubernetes Operations

LevelAdvanced

Duration90 mins

TopicKubernetes Operations

2 / 5

Auto-scaling (HPA, VPA)

The Promise of Elastic Infrastructure

One of Kubernetes' most powerful capabilities is auto-scaling—the ability to automatically adjust compute resources in response to changing demand. Done right, auto-scaling delivers the dream of elastic infrastructure: your application seamlessly handles traffic spikes without manual intervention, then scales down during quiet periods to minimize costs.

But auto-scaling is not magic. Poorly configured autoscalers can cause oscillation (rapid scale up/down cycles), fail to respond quickly enough to traffic spikes, or scale too aggressively and waste resources. Mastering auto-scaling requires understanding the different scaling dimensions, the algorithms that drive them, and the intricate interplay between resource configuration and scaling behavior.

Kubernetes provides three complementary scaling mechanisms:

Horizontal Pod Autoscaler (HPA): Adjusts the number of pod replicas
Vertical Pod Autoscaler (VPA): Adjusts resource requests/limits of existing pods
Cluster Autoscaler: Adjusts the number of nodes in the cluster

This page provides a deep, production-focused exploration of each mechanism.

What You Will Learn

By the end of this page, you'll understand how to configure HPA with built-in and custom metrics, implement VPA for automatic resource right-sizing, coordinate between HPA and VPA, integrate with Cluster Autoscaler, and design scaling strategies that balance responsiveness with stability.

Understanding Scaling Dimensions

Before diving into specific autoscalers, it's essential to understand the different dimensions along which Kubernetes can scale.

Horizontal Scaling (Scale Out/In)

Adds or removes pod replicas while keeping each pod's resource allocation constant. This is the most common scaling approach for stateless services.

Advantages:

Fast response (creating pods is quick)
No disruption to existing pods
Works well with load balancers and stateless services
Theoretically unlimited scale

Limitations:

Doesn't help if each pod is undersized
Requires application to be horizontally scalable (stateless or with proper session management)
More pods = more coordination overhead

Vertical Scaling (Scale Up/Down)

Increases or decreases the CPU/memory allocated to each pod. Requires pod restart to apply (in most cases).

Advantages:

Helps when single-request processing needs more resources
Reduces coordination overhead (fewer, larger pods)
Better for stateful workloads that can't easily add replicas

Limitations:

Bounded by node capacity (can't scale beyond a single node)
Pod restart required (disruption)
Slower response time

Cluster Scaling (Node Addition/Removal)

Adds or removes nodes from the cluster. Necessary when existing nodes are fully utilized.

Scaling Dimensions Comparison
Dimension	What Changes	Speed	Disruption	Limit
Horizontal (HPA)	Number of pods	Fast (seconds)	None to existing pods	Cluster capacity
Vertical (VPA)	Pod CPU/memory	Slow (minutes)	Pod restart required	Node capacity
Cluster (CA)	Number of nodes	Slow (minutes)	None to existing pods	Cloud/budget limit

Combining Scaling Dimensions

The most robust scaling strategy combines all three dimensions: HPA responds quickly to load changes, VPA right-sizes pods during off-peak hours, and Cluster Autoscaler ensures nodes exist to accommodate the workload. Getting these to work together harmoniously is the key challenge.

Horizontal Pod Autoscaler (HPA) Deep Dive

The Horizontal Pod Autoscaler (HPA) is Kubernetes' most commonly used autoscaler. It automatically scales the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed metrics.

How HPA Works:

Metric Collection: HPA controller queries the Metrics Server (or custom metrics adapter) at regular intervals (default: 15 seconds)
Desired Replica Calculation: Using the target metric value, HPA calculates the desired replica count:
```
desiredReplicas = ceil(currentReplicas × (currentMetricValue / targetMetricValue))
```
Stabilization: HPA applies stabilization windows to prevent oscillation (scale-up and scale-down have different windows)
Scaling Action: If desired replicas differs from current, HPA patches the target's spec.replicas

HPA API Versions:

autoscaling/v1: Basic CPU-only scaling
autoscaling/v2: Full-featured with memory, custom metrics, and behaviors (use this)

hpa-comprehensive.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-service-hpa
  namespace: production
spec:
  # Target workload to scale
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-service
  
  # Replica bounds
  minReplicas: 3          # Never go below 3 replicas
  maxReplicas: 50         # Never exceed 50 replicas
  
  # Metrics to scale on (multiple metrics combined)
  metrics:
  # Metric 1: CPU utilization (most common)
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70    # Target 70% CPU utilization
  
  # Metric 2: Memory utilization
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80    # Target 80% memory utilization
  
  # Metric 3: Custom metric (requests per second per pod)
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"      # Target 1000 RPS per pod
  
  # Metric 4: External metric (queue depth)
  - type: External
    external:
      metric:
        name: sqs_queue_depth
        selector:
          matchLabels:
            queue: orders
      target:
        type: Value
        value: "100"              # Scale when queue > 100 messages
  
  # Scaling behavior (v2 feature)
  behavior:
    # Scale-up behavior: Aggressive for responsiveness
    scaleUp:
      stabilizationWindowSeconds: 0    # Immediate scale-up
      policies:
      - type: Percent
        value: 100                      # Allow doubling replicas
        periodSeconds: 15
      - type: Pods
        value: 4                        # Or add 4 pods
        periodSeconds: 15
      selectPolicy: Max                 # Use whichever adds more pods
    
    # Scale-down behavior: Conservative for stability
    scaleDown:
      stabilizationWindowSeconds: 300   # 5-minute window
      policies:
      - type: Percent
        value: 10                       # Remove at most 10% of replicas
        periodSeconds: 60
      selectPolicy: Max

Understanding the HPA Algorithm:

When multiple metrics are specified, HPA calculates the desired replica count for each metric independently, then takes the maximum. This ensures the deployment can handle whichever constraint is most demanding.

Calculation Example:

Current replicas: 5
CPU utilization: 90% (target 70%) → Desired: ceil(5 × 90/70) = 7
Memory utilization: 60% (target 80%) → Desired: ceil(5 × 60/80) = 4
RPS: 6000 (target 1000/pod) → Desired: ceil(6000/1000) = 6
Final decision: 7 replicas (maximum of 7, 4, 6)

Critical HPA Prerequisites:

HPA Requirements

•Metrics Server must be running: HPA cannot function without metrics. Install via kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
•Resource requests must be set: CPU/memory utilization is calculated as current / request. Without requests, utilization is undefined and HPA won't scale on resources.
•Custom metrics require adapters: For non-resource metrics, deploy Prometheus Adapter, Datadog Cluster Agent, or similar.
•External metrics require providers: External metrics (SQS, Pub/Sub, etc.) need appropriate adapters configured.

HPA Behaviors and Stabilization

The behavior section in HPA (introduced in autoscaling/v2) gives fine-grained control over how quickly scaling happens. This is crucial for preventing thrashing (rapid scale-up followed by immediate scale-down) while maintaining responsiveness.

Stabilization Windows:

The stabilization window is a lookback period where HPA considers all calculated replica values and chooses the highest (for scale-up stability) or lowest (for scale-down stability).

Default stabilization:

Scale-up: 0 seconds (immediate)
Scale-down: 300 seconds (5 minutes)

This asymmetry reflects operational reality: scaling up should be fast to handle load, but scaling down should be cautious to avoid premature capacity removal.

Scaling Policies:

Policies define how much can be scaled in a given period. You can specify multiple policies and use selectPolicy to choose between them.

hpa-behavior-patterns.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# Pattern 1: Aggressive scale-up, very conservative scale-down
# Use case: Customer-facing services where availability is critical
behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Percent
      value: 200              # Allow tripling
      periodSeconds: 15
    selectPolicy: Max
  scaleDown:
    stabilizationWindowSeconds: 600   # 10-minute window
    policies:
    - type: Pods
      value: 1                        # Only remove 1 pod
      periodSeconds: 120              # Every 2 minutes
    selectPolicy: Max
 
---
# Pattern 2: Balanced scaling for predictable workloads
# Use case: Internal services with predictable traffic
behavior:
  scaleUp:
    stabilizationWindowSeconds: 60    # Wait 1 minute before scaling up
    policies:
    - type: Pods
      value: 2
      periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 180   # 3-minute window
    policies:
    - type: Percent
      value: 25                       # Remove at most 25%
      periodSeconds: 60
 
---
# Pattern 3: Disable scale-down entirely (scale-up only)
# Use case: Pre-scaling before known events, manual scale-down
behavior:
  scaleDown:
    selectPolicy: Disabled
 
---
# Pattern 4: Rapid scaling for batch processors
# Use case: Queue workers that should scale fast in both directions
behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Pods
      value: 10
      periodSeconds: 15
  scaleDown:
    stabilizationWindowSeconds: 30
    policies:
    - type: Percent
      value: 50
      periodSeconds: 30

The Thrashing Problem

Without proper stabilization, HPA can thrash: traffic spike causes scale-up, new pods absorb load, utilization drops, HPA scales down, remaining pods become overloaded, HPA scales up again. The 5-minute default scale-down window prevents this, but if you reduce it, monitor for oscillation patterns.

Custom Metrics with HPA

While CPU and memory are useful, many workloads need to scale on business metrics: requests per second, queue depth, active connections, or custom application metrics. This requires a custom metrics pipeline.

Custom Metrics Architecture:

Application exposes metrics: Prometheus format (/metrics endpoint)
Prometheus scrapes metrics: Collects and stores time-series data
Prometheus Adapter translates: Exposes Prometheus metrics as Kubernetes custom metrics API
HPA queries API: Uses custom.metrics.k8s.io or external.metrics.k8s.io

Metric Types:

Pods metrics (type: Pods): Metric value per pod (e.g., requests_per_second averaged across pods)
Object metrics (type: Object): Metric from a specific Kubernetes object (e.g., Ingress request count)
External metrics (type: External): Metric from external system (e.g., cloud queue depth)

prometheus-adapter-config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Prometheus Adapter configuration for custom metrics
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
    # Rule 1: HTTP requests per second per pod
    - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)_total$"
        as: "1_per_second"
      metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
    
    # Rule 2: Queue depth from external queue
    - seriesQuery: 'sqs_queue_messages_visible{queue_name!=""}'
      resources:
        template: "<<.Resource>>"
      name:
        matches: "^sqs_queue_messages_visible$"
        as: "sqs_queue_depth"
      metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (queue_name)'
    
    # Rule 3: Active WebSocket connections
    - seriesQuery: 'websocket_active_connections{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        as: "active_connections"
      metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

hpa-custom-metrics.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 5
  maxReplicas: 100
  
  metrics:
  # Scale on requests per second per pod
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "500"       # Each pod handles 500 RPS
  
  # Scale on p99 latency (scale up if latency too high)
  - type: Pods
    pods:
      metric:
        name: http_request_duration_p99
      target:
        type: AverageValue
        averageValue: "200m"      # Target p99 < 200ms
  
  # Scale on external queue depth
  - type: External
    external:
      metric:
        name: sqs_queue_depth
        selector:
          matchLabels:
            queue_name: api-requests
      target:
        type: AverageValue
        averageValue: "10"        # 10 messages per pod

Choosing the Right Metric

The best scaling metric is one that directly reflects user-facing impact. For web services, requests-per-second is often better than CPU because it captures actual demand. For queue workers, queue depth is ideal. For latency-sensitive services, consider scaling on p95/p99 latency—if latency rises, add capacity.

Vertical Pod Autoscaler (VPA) Deep Dive

The Vertical Pod Autoscaler (VPA) automatically adjusts CPU and memory requests/limits for containers. Unlike HPA which adds replicas, VPA resizes existing pods—often requiring restarts.

VPA Components:

Recommender: Analyzes container resource usage history and provides recommendations
Updater: Evicts pods that need resizing (triggering recreation with new resources)
Admission Controller: Mutates new pod specs with recommended resources

VPA Update Modes:

Off: Only provides recommendations, no automatic updates
Initial: Applies recommendations only at pod creation
Recreate: Evicts and recreates pods to apply recommendations
Auto: Same as Recreate (may change in future)

When to Use VPA:

Applications with unpredictable resource needs
Right-sizing after initial deployment
Workloads that can't scale horizontally (single-instance databases)
Reducing resource waste from over-provisioning

vpa-configuration.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: backend-service-vpa
  namespace: production
spec:
  # Target workload
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: backend-service
  
  # Update policy
  updatePolicy:
    updateMode: "Auto"          # Automatically apply recommendations
    minReplicas: 2              # Don't disrupt if only 1 replica running
  
  # Resource policy (constraints on recommendations)
  resourcePolicy:
    containerPolicies:
    - containerName: main-app
      # Mode for this container
      mode: "Auto"
      
      # Minimum resources (never recommend less than this)
      minAllowed:
        cpu: "100m"
        memory: "128Mi"
      
      # Maximum resources (never recommend more than this)
      maxAllowed:
        cpu: "4"
        memory: "8Gi"
      
      # Which resources VPA can modify
      controlledResources: ["cpu", "memory"]
      
      # Control values (can set for resource-specific control)
      controlledValues: RequestsAndLimits
    
    - containerName: sidecar
      # Don't let VPA touch the sidecar
      mode: "Off"
 
---
# VPA in recommendation-only mode (safe starting point)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa-recommender
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"           # Only recommend, don't apply
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: "50m"
        memory: "64Mi"
      maxAllowed:
        cpu: "8"
        memory: "16Gi"

Viewing VPA Recommendations:

# Get VPA status and recommendations
kubectl describe vpa backend-service-vpa

# Output shows:
# - Target: Current resource values
# - Lower Bound: Minimum recommended
# - Upper Bound: Maximum recommended  
# - Uncapped Target: Recommendation ignoring minAllowed/maxAllowed

VPA Recommendation Types:

Target: The recommended value—apply this for optimal sizing
Lower Bound: The 10th percentile—rarely needs this little
Upper Bound: The 90th percentile—rarely needs this much
Uncapped Target: What VPA would recommend without constraints

VPA Limitations

VPA cannot be used with HPA on the same metrics. If HPA scales on CPU and VPA adjusts CPU requests, they fight each other. The solution: use HPA for scaling replicas on custom metrics (RPS, queue depth), and VPA for resource right-sizing on CPU/memory. Or use VPA in 'Off' mode just for recommendations.

Combining HPA and VPA: Safe Patterns

Using HPA and VPA together requires careful coordination to avoid conflicts. The fundamental issue: HPA uses resource utilization (current/request) for scaling decisions. If VPA changes requests, it affects utilization calculations.

Pattern 1: HPA on Custom Metrics, VPA on Resources

The safest approach: configure HPA to scale on custom metrics only (RPS, queue depth, latency) while VPA manages resource requests. Since HPA doesn't use resource utilization, there's no conflict.

Pattern 2: VPA Recommendations Only

Run VPA in Off mode to get recommendations without automatic application. Use recommendations to manually tune resource requests during planned maintenance windows.

Pattern 3: Multidimensional Pod Autoscaler (Beta)

The Kubernetes project is developing a Multidimensional Pod Autoscaler (MPA) that unifies HPA and VPA decision-making. Until it's stable, use the patterns below.

hpa-vpa-combination.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Pattern 1: HPA on custom metrics, VPA on resources
# This is the recommended production pattern
 
# HPA: scales replicas based on requests per second
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 30
  metrics:
  # Only use custom metrics - NOT cpu/memory
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
 
---
# VPA: manages CPU and memory for optimal sizing
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: api-container
      minAllowed:
        cpu: "200m"
        memory: "256Mi"
      maxAllowed:
        cpu: "2"
        memory: "4Gi"
      controlledResources: ["cpu", "memory"]

HPA + VPA Compatibility Matrix
HPA Metric	VPA Mode	Compatible?	Risk
CPU/Memory	Auto/Recreate	❌ No	Scaling conflicts
CPU/Memory	Off	✅ Yes	Manual tuning needed
Custom (RPS)	Auto/Recreate	✅ Yes	None
Custom (RPS)	Off	✅ Yes	None, safest option
Mixed (CPU + Custom)	Auto	⚠️ Risky	Partial conflicts

The Future: Multidimensional Pod Autoscaler

The Kubernetes autoscaling SIG is developing the Multidimensional Pod Autoscaler (MPA) to solve HPA+VPA coordination. MPA will provide a single controller that makes unified scaling decisions across replica count and resource requests. Monitor KEP-2353 for progress.

Cluster Autoscaler: Scaling the Infrastructure

HPA and VPA scale pods, but what if cluster capacity is exhausted? The Cluster Autoscaler (CA) adds and removes nodes from cloud provider node groups based on pending pods and underutilization.

How Cluster Autoscaler Works:

Scale-Up:

Scheduler cannot place a pod (insufficient resources on any node)
Pod enters Pending state with Unschedulable condition
Cluster Autoscaler detects pending pods
CA simulates adding nodes from each node group
If a node group can satisfy the pending pods, CA triggers scale-up
New node joins cluster, pending pods get scheduled

Scale-Down:

CA periodically scans for underutilized nodes (default: <50% request utilization)
CA checks if pods could be rescheduled elsewhere
CA respects PodDisruptionBudgets and local storage
If safe, CA cordons, drains, and terminates the node

Scale-Down Blockers:

Pods with local storage (unless opt-in)
Pods in kube-system without PDB
Pods with restrictive PodDisruptionBudget
Standalone pods (not owned by controller)
Nodes with scale-down disabled annotation

cluster-autoscaler-deployment.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Cluster Autoscaler deployment (AWS EKS example)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
      - name: cluster-autoscaler
        image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.28.0
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste           # Node selection strategy
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
        - --balance-similar-node-groups    # Balance across AZs
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10m    # Wait 10m after scale-up
        - --scale-down-delay-after-delete=0s
        - --scale-down-unneeded-time=10m      # Must be unneeded for 10m
        - --scale-down-utilization-threshold=0.5  # Below 50% = underutilized
        - --max-node-provision-time=15m
        - --max-graceful-termination-sec=600
        resources:
          limits:
            cpu: 100m
            memory: 600Mi
          requests:
            cpu: 100m
            memory: 600Mi
      priorityClassName: system-cluster-critical

Cluster Autoscaler Expander Strategies

•random: Randomly pick from node groups that can satisfy demand (simple but unpredictable)
•most-pods: Choose node group that schedules the most pending pods
•least-waste: Choose node group that leaves least unused capacity after scheduling (cost-efficient)
•price: Choose cheapest node group (requires cloud pricing integration)
•priority: User-defined priority ordering via ConfigMap

Cluster Autoscaler Best Practices

Use multiple node groups (spot/preemptible for batch, on-demand for reliability). 2. Enable balance-similar-node-groups for multi-AZ clusters. 3. Set appropriate scale-down delays—10 minutes prevents thrashing. 4. Use pod-priority to ensure critical pods trigger scale-up first. 5. Always set resource requests; CA can't help unschedulable pods that fit on existing nodes.

Karpenter: Just-in-Time Node Provisioning

Karpenter is an open-source, high-performance Kubernetes cluster autoscaler designed by AWS. Unlike Cluster Autoscaler which relies on predefined node groups, Karpenter provisions nodes on-demand with the exact specifications needed for pending pods.

Key Differences from Cluster Autoscaler:

Cluster Autoscaler	Karpenter
Scales predefined node groups	Provisions nodes based on pod requirements
Limited instance type flexibility	Chooses from all compatible instance types
Slower (waits for ASG scaling)	Faster (direct EC2 API calls)
Simpler setup	More powerful but more configuration

How Karpenter Works:

Karpenter watches for unschedulable pods
Groups pending pods by compatible requirements (node selectors, tolerations, resource needs)
Calculates optimal instance type considering: pod requirements, architecture (ARM/AMD64), capacity type (on-demand/spot), availability zones
Provisions instance directly via cloud API
Node joins cluster, pods get scheduled

Karpenter enables:

Faster scaling (bypasses ASG dynamics)
Better bin-packing (right-sizes nodes for actual pod needs)
Mixed instance type strategies
Automatic spot instance diversification

karpenter-nodepool.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# Karpenter NodePool (v1beta1 API)
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  # Template for nodes provisioned by this pool
  template:
    spec:
      # Requirements for node selection
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64", "arm64"]
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand", "spot"]
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values: ["c", "m", "r"]        # Compute, general, memory-optimized
      - key: karpenter.k8s.aws/instance-size
        operator: In
        values: ["medium", "large", "xlarge", "2xlarge"]
      - key: topology.kubernetes.io/zone
        operator: In
        values: ["us-west-2a", "us-west-2b", "us-west-2c"]
      
      # Node configuration
      nodeClassRef:
        name: default
      
      # Taints applied to nodes (optional)
      taints:
      - key: example.com/special-hardware
        effect: NoSchedule
        value: "true"
  
  # Limits on total resources this pool can provision
  limits:
    cpu: 1000
    memory: 2000Gi
  
  # Disruption settings
  disruption:
    # Consolidation: pack pods onto fewer nodes when possible
    consolidationPolicy: WhenUnderutilized    # or WhenEmpty
    consolidateAfter: 30s
    
    # Budget for disruptions
    budgets:
    - nodes: "10%"        # Allow disrupting 10% of nodes at a time
 
---
# EC2NodeClass: AWS-specific node configuration
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2                # Amazon Linux 2
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: my-cluster
  role: KarpenterNodeRole-my-cluster
  
  # Block device configuration
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      volumeSize: 100Gi
      volumeType: gp3
      encrypted: true

When to Choose Karpenter vs Cluster Autoscaler

Use Karpenter if: You're on AWS (or Azure preview), need fast scaling, want optimal bin-packing, use mixed instance types. Use Cluster Autoscaler if: You're on non-AWS clouds, need simpler configuration, have stable, predictable node requirements, or use GKE Autopilot (built-in scaling).

Scaling Anti-Patterns and Troubleshooting

Auto-scaling failures often stem from configuration mistakes rather than autoscaler bugs. Here are the most common anti-patterns and their solutions:

Anti-Pattern 1: Missing Resource Requests

HPA calculates utilization as current / request. Without requests, utilization is undefined. Pods show 0% or unknown utilization, and HPA cannot scale.

Solution: Always set resource requests. Use VPA in 'Off' mode to get recommendations if unsure.

Anti-Pattern 2: maxReplicas Too Low

HPA hits maxReplicas during traffic spike. Even though utilization is high, no more pods are added. Users experience degradation.

Solution: Set maxReplicas with headroom for unexpected spikes. Consider 2-3x peak expected replicas.

Anti-Pattern 3: Slow Pod Startup

HPA adds pods, but they take 60+ seconds to start serving traffic. During this window, existing pods remain overloaded, triggering more scale-up. Result: over-provisioning followed by mass scale-down.

Solution: Optimize container startup time. Use proper readiness probes. Consider pod priority and preemption for faster scheduling.

More Scaling Anti-Patterns

•Wrong metrics: Scaling on CPU for queue workers (should scale on queue depth); scaling on memory for stateless apps (memory often stable despite load changes)
•Too-sensitive targets: Setting 50% CPU target causes constant scaling with normal fluctuation. Start with 70-80% for CPU.
•Ignoring startup load: Applications that do heavy initialization (cache warming, model loading) spike resources on start, triggering immediate scale-up
•Pending pods without CA: HPA creates pods but no nodes available. Without Cluster Autoscaler, pods remain pending indefinitely.
•PodDisruptionBudget blocking scale-down: Overly restrictive PDBs prevent node draining, blocking Cluster Autoscaler scale-down

scaling-troubleshooting.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/bin/bash
# Troubleshooting HPA issues
 
# Check HPA status and events
kubectl describe hpa <hpa-name> -n <namespace>
 
# Key things to look for:
# - Current/Desired replicas
# - Current metrics vs targets
# - Conditions (ScalingActive, AbleToScale, ScalingLimited)
# - Events (scaling decisions, errors)
 
# Check if metrics server is working
kubectl top pods -n <namespace>
kubectl top nodes
 
# If metrics unavailable, check metrics-server
kubectl get pods -n kube-system -l k8s-app=metrics-server
kubectl logs -n kube-system -l k8s-app=metrics-server
 
# Check custom metrics API
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .
 
# Check external metrics API  
kubectl get --raw /apis/external.metrics.k8s.io/v1beta1 | jq .
 
# Debug Cluster Autoscaler
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=100
 
# Check for pending pods (indicates CA should scale up)
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
 
# Check node utilization (CA scale-down decisions)
kubectl describe nodes | grep -A5 "Allocated resources"

Summary: Auto-scaling Best Practices

Effective auto-scaling is essential for running efficient, responsive Kubernetes workloads. Let's consolidate the key principles:

Key Takeaways

•Always set resource requests: HPA utilization requires requests; Cluster Autoscaler needs them for scheduling simulation
•Choose metrics wisely: Scale on metrics that directly reflect user impact (RPS, latency, queue depth) rather than just CPU
•Configure behaviors carefully: Aggressive scale-up, conservative scale-down prevents thrashing
•Combine HPA and VPA safely: HPA on custom metrics + VPA on resources avoids conflicts
•Enable Cluster Autoscaler: Otherwise, pending pods stay pending indefinitely
•Consider Karpenter for AWS: Faster scaling, better bin-packing, more flexibility
•Monitor scaling metrics: Track replicas over time, throttling, OOM kills, pending pod duration
•Set appropriate min/max replicas: minReplicas for baseline availability, maxReplicas with headroom for spikes
•Test scaling behavior: Simulate load spikes before they happen in production
•Optimize pod startup time: Slow-starting pods cause over-provisioning during scale events

What's Next:

With auto-scaling in place, your deployments can respond dynamically to load. The next page covers Rolling Updates and Rollbacks—strategies for safely deploying changes to running applications without downtime, and quickly reverting when things go wrong.

Page Complete

You now understand Kubernetes auto-scaling in depth—from HPA and VPA fundamentals through custom metrics, Cluster Autoscaler, Karpenter, and production patterns for combining autoscalers effectively. Apply these principles to build self-adjusting infrastructure that balances cost and performance.

2 / 5

Loading learning content...

System Design (HLD)Kubernetes Operations

Kubernetes Operations

LevelAdvanced

Duration90 mins

TopicKubernetes Operations

2 / 5

Auto-scaling (HPA, VPA)

The Promise of Elastic Infrastructure

Kubernetes provides three complementary scaling mechanisms:

Horizontal Pod Autoscaler (HPA): Adjusts the number of pod replicas
Vertical Pod Autoscaler (VPA): Adjusts resource requests/limits of existing pods
Cluster Autoscaler: Adjusts the number of nodes in the cluster

This page provides a deep, production-focused exploration of each mechanism.

What You Will Learn

Understanding Scaling Dimensions

Before diving into specific autoscalers, it's essential to understand the different dimensions along which Kubernetes can scale.

Horizontal Scaling (Scale Out/In)

Adds or removes pod replicas while keeping each pod's resource allocation constant. This is the most common scaling approach for stateless services.

Advantages:

Fast response (creating pods is quick)
No disruption to existing pods
Works well with load balancers and stateless services
Theoretically unlimited scale

Limitations:

Doesn't help if each pod is undersized
Requires application to be horizontally scalable (stateless or with proper session management)
More pods = more coordination overhead

Vertical Scaling (Scale Up/Down)

Increases or decreases the CPU/memory allocated to each pod. Requires pod restart to apply (in most cases).

Advantages:

Helps when single-request processing needs more resources
Reduces coordination overhead (fewer, larger pods)
Better for stateful workloads that can't easily add replicas

Limitations:

Bounded by node capacity (can't scale beyond a single node)
Pod restart required (disruption)
Slower response time

Cluster Scaling (Node Addition/Removal)

Adds or removes nodes from the cluster. Necessary when existing nodes are fully utilized.

Scaling Dimensions Comparison
Dimension	What Changes	Speed	Disruption	Limit
Horizontal (HPA)	Number of pods	Fast (seconds)	None to existing pods	Cluster capacity
Vertical (VPA)	Pod CPU/memory	Slow (minutes)	Pod restart required	Node capacity
Cluster (CA)	Number of nodes	Slow (minutes)	None to existing pods	Cloud/budget limit

Combining Scaling Dimensions

Horizontal Pod Autoscaler (HPA) Deep Dive

How HPA Works:

Metric Collection: HPA controller queries the Metrics Server (or custom metrics adapter) at regular intervals (default: 15 seconds)
Desired Replica Calculation: Using the target metric value, HPA calculates the desired replica count:
```
desiredReplicas = ceil(currentReplicas × (currentMetricValue / targetMetricValue))
```
Stabilization: HPA applies stabilization windows to prevent oscillation (scale-up and scale-down have different windows)
Scaling Action: If desired replicas differs from current, HPA patches the target's spec.replicas

HPA API Versions:

autoscaling/v1: Basic CPU-only scaling
autoscaling/v2: Full-featured with memory, custom metrics, and behaviors (use this)

hpa-comprehensive.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-service-hpa
  namespace: production
spec:
  # Target workload to scale
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-service
  
  # Replica bounds
  minReplicas: 3          # Never go below 3 replicas
  maxReplicas: 50         # Never exceed 50 replicas
  
  # Metrics to scale on (multiple metrics combined)
  metrics:
  # Metric 1: CPU utilization (most common)
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70    # Target 70% CPU utilization
  
  # Metric 2: Memory utilization
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80    # Target 80% memory utilization
  
  # Metric 3: Custom metric (requests per second per pod)
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"      # Target 1000 RPS per pod
  
  # Metric 4: External metric (queue depth)
  - type: External
    external:
      metric:
        name: sqs_queue_depth
        selector:
          matchLabels:
            queue: orders
      target:
        type: Value
        value: "100"              # Scale when queue > 100 messages
  
  # Scaling behavior (v2 feature)
  behavior:
    # Scale-up behavior: Aggressive for responsiveness
    scaleUp:
      stabilizationWindowSeconds: 0    # Immediate scale-up
      policies:
      - type: Percent
        value: 100                      # Allow doubling replicas
        periodSeconds: 15
      - type: Pods
        value: 4                        # Or add 4 pods
        periodSeconds: 15
      selectPolicy: Max                 # Use whichever adds more pods
    
    # Scale-down behavior: Conservative for stability
    scaleDown:
      stabilizationWindowSeconds: 300   # 5-minute window
      policies:
      - type: Percent
        value: 10                       # Remove at most 10% of replicas
        periodSeconds: 60
      selectPolicy: Max

Understanding the HPA Algorithm:

Calculation Example:

Current replicas: 5
CPU utilization: 90% (target 70%) → Desired: ceil(5 × 90/70) = 7
Memory utilization: 60% (target 80%) → Desired: ceil(5 × 60/80) = 4
RPS: 6000 (target 1000/pod) → Desired: ceil(6000/1000) = 6
Final decision: 7 replicas (maximum of 7, 4, 6)

Critical HPA Prerequisites:

HPA Requirements

•Metrics Server must be running: HPA cannot function without metrics. Install via kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
•Resource requests must be set: CPU/memory utilization is calculated as current / request. Without requests, utilization is undefined and HPA won't scale on resources.
•Custom metrics require adapters: For non-resource metrics, deploy Prometheus Adapter, Datadog Cluster Agent, or similar.
•External metrics require providers: External metrics (SQS, Pub/Sub, etc.) need appropriate adapters configured.

HPA Behaviors and Stabilization

Stabilization Windows:

The stabilization window is a lookback period where HPA considers all calculated replica values and chooses the highest (for scale-up stability) or lowest (for scale-down stability).

Default stabilization:

Scale-up: 0 seconds (immediate)
Scale-down: 300 seconds (5 minutes)

This asymmetry reflects operational reality: scaling up should be fast to handle load, but scaling down should be cautious to avoid premature capacity removal.

Scaling Policies:

Policies define how much can be scaled in a given period. You can specify multiple policies and use selectPolicy to choose between them.

hpa-behavior-patterns.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# Pattern 1: Aggressive scale-up, very conservative scale-down
# Use case: Customer-facing services where availability is critical
behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Percent
      value: 200              # Allow tripling
      periodSeconds: 15
    selectPolicy: Max
  scaleDown:
    stabilizationWindowSeconds: 600   # 10-minute window
    policies:
    - type: Pods
      value: 1                        # Only remove 1 pod
      periodSeconds: 120              # Every 2 minutes
    selectPolicy: Max
 
---
# Pattern 2: Balanced scaling for predictable workloads
# Use case: Internal services with predictable traffic
behavior:
  scaleUp:
    stabilizationWindowSeconds: 60    # Wait 1 minute before scaling up
    policies:
    - type: Pods
      value: 2
      periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 180   # 3-minute window
    policies:
    - type: Percent
      value: 25                       # Remove at most 25%
      periodSeconds: 60
 
---
# Pattern 3: Disable scale-down entirely (scale-up only)
# Use case: Pre-scaling before known events, manual scale-down
behavior:
  scaleDown:
    selectPolicy: Disabled
 
---
# Pattern 4: Rapid scaling for batch processors
# Use case: Queue workers that should scale fast in both directions
behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Pods
      value: 10
      periodSeconds: 15
  scaleDown:
    stabilizationWindowSeconds: 30
    policies:
    - type: Percent
      value: 50
      periodSeconds: 30

The Thrashing Problem

Custom Metrics with HPA

Custom Metrics Architecture:

Application exposes metrics: Prometheus format (/metrics endpoint)
Prometheus scrapes metrics: Collects and stores time-series data
Prometheus Adapter translates: Exposes Prometheus metrics as Kubernetes custom metrics API
HPA queries API: Uses custom.metrics.k8s.io or external.metrics.k8s.io

Metric Types:

Pods metrics (type: Pods): Metric value per pod (e.g., requests_per_second averaged across pods)
Object metrics (type: Object): Metric from a specific Kubernetes object (e.g., Ingress request count)
External metrics (type: External): Metric from external system (e.g., cloud queue depth)

prometheus-adapter-config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Prometheus Adapter configuration for custom metrics
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
    # Rule 1: HTTP requests per second per pod
    - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)_total$"
        as: "1_per_second"
      metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
    
    # Rule 2: Queue depth from external queue
    - seriesQuery: 'sqs_queue_messages_visible{queue_name!=""}'
      resources:
        template: "<<.Resource>>"
      name:
        matches: "^sqs_queue_messages_visible$"
        as: "sqs_queue_depth"
      metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (queue_name)'
    
    # Rule 3: Active WebSocket connections
    - seriesQuery: 'websocket_active_connections{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        as: "active_connections"
      metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

hpa-custom-metrics.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 5
  maxReplicas: 100
  
  metrics:
  # Scale on requests per second per pod
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "500"       # Each pod handles 500 RPS
  
  # Scale on p99 latency (scale up if latency too high)
  - type: Pods
    pods:
      metric:
        name: http_request_duration_p99
      target:
        type: AverageValue
        averageValue: "200m"      # Target p99 < 200ms
  
  # Scale on external queue depth
  - type: External
    external:
      metric:
        name: sqs_queue_depth
        selector:
          matchLabels:
            queue_name: api-requests
      target:
        type: AverageValue
        averageValue: "10"        # 10 messages per pod

Choosing the Right Metric

Vertical Pod Autoscaler (VPA) Deep Dive

The Vertical Pod Autoscaler (VPA) automatically adjusts CPU and memory requests/limits for containers. Unlike HPA which adds replicas, VPA resizes existing pods—often requiring restarts.

VPA Components:

Recommender: Analyzes container resource usage history and provides recommendations
Updater: Evicts pods that need resizing (triggering recreation with new resources)
Admission Controller: Mutates new pod specs with recommended resources

VPA Update Modes:

Off: Only provides recommendations, no automatic updates
Initial: Applies recommendations only at pod creation
Recreate: Evicts and recreates pods to apply recommendations
Auto: Same as Recreate (may change in future)

When to Use VPA:

Applications with unpredictable resource needs
Right-sizing after initial deployment
Workloads that can't scale horizontally (single-instance databases)
Reducing resource waste from over-provisioning

vpa-configuration.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: backend-service-vpa
  namespace: production
spec:
  # Target workload
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: backend-service
  
  # Update policy
  updatePolicy:
    updateMode: "Auto"          # Automatically apply recommendations
    minReplicas: 2              # Don't disrupt if only 1 replica running
  
  # Resource policy (constraints on recommendations)
  resourcePolicy:
    containerPolicies:
    - containerName: main-app
      # Mode for this container
      mode: "Auto"
      
      # Minimum resources (never recommend less than this)
      minAllowed:
        cpu: "100m"
        memory: "128Mi"
      
      # Maximum resources (never recommend more than this)
      maxAllowed:
        cpu: "4"
        memory: "8Gi"
      
      # Which resources VPA can modify
      controlledResources: ["cpu", "memory"]
      
      # Control values (can set for resource-specific control)
      controlledValues: RequestsAndLimits
    
    - containerName: sidecar
      # Don't let VPA touch the sidecar
      mode: "Off"
 
---
# VPA in recommendation-only mode (safe starting point)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa-recommender
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"           # Only recommend, don't apply
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: "50m"
        memory: "64Mi"
      maxAllowed:
        cpu: "8"
        memory: "16Gi"

Viewing VPA Recommendations:

# Get VPA status and recommendations
kubectl describe vpa backend-service-vpa

# Output shows:
# - Target: Current resource values
# - Lower Bound: Minimum recommended
# - Upper Bound: Maximum recommended  
# - Uncapped Target: Recommendation ignoring minAllowed/maxAllowed

VPA Recommendation Types:

Target: The recommended value—apply this for optimal sizing
Lower Bound: The 10th percentile—rarely needs this little
Upper Bound: The 90th percentile—rarely needs this much
Uncapped Target: What VPA would recommend without constraints

VPA Limitations

Combining HPA and VPA: Safe Patterns

Pattern 1: HPA on Custom Metrics, VPA on Resources

The safest approach: configure HPA to scale on custom metrics only (RPS, queue depth, latency) while VPA manages resource requests. Since HPA doesn't use resource utilization, there's no conflict.

Pattern 2: VPA Recommendations Only

Run VPA in Off mode to get recommendations without automatic application. Use recommendations to manually tune resource requests during planned maintenance windows.

Pattern 3: Multidimensional Pod Autoscaler (Beta)

The Kubernetes project is developing a Multidimensional Pod Autoscaler (MPA) that unifies HPA and VPA decision-making. Until it's stable, use the patterns below.

hpa-vpa-combination.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Pattern 1: HPA on custom metrics, VPA on resources
# This is the recommended production pattern
 
# HPA: scales replicas based on requests per second
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 30
  metrics:
  # Only use custom metrics - NOT cpu/memory
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
 
---
# VPA: manages CPU and memory for optimal sizing
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: api-container
      minAllowed:
        cpu: "200m"
        memory: "256Mi"
      maxAllowed:
        cpu: "2"
        memory: "4Gi"
      controlledResources: ["cpu", "memory"]

HPA + VPA Compatibility Matrix
HPA Metric	VPA Mode	Compatible?	Risk
CPU/Memory	Auto/Recreate	❌ No	Scaling conflicts
CPU/Memory	Off	✅ Yes	Manual tuning needed
Custom (RPS)	Auto/Recreate	✅ Yes	None
Custom (RPS)	Off	✅ Yes	None, safest option
Mixed (CPU + Custom)	Auto	⚠️ Risky	Partial conflicts

The Future: Multidimensional Pod Autoscaler

Cluster Autoscaler: Scaling the Infrastructure

HPA and VPA scale pods, but what if cluster capacity is exhausted? The Cluster Autoscaler (CA) adds and removes nodes from cloud provider node groups based on pending pods and underutilization.

How Cluster Autoscaler Works:

Scale-Up:

Scheduler cannot place a pod (insufficient resources on any node)
Pod enters Pending state with Unschedulable condition
Cluster Autoscaler detects pending pods
CA simulates adding nodes from each node group
If a node group can satisfy the pending pods, CA triggers scale-up
New node joins cluster, pending pods get scheduled

Scale-Down:

CA periodically scans for underutilized nodes (default: <50% request utilization)
CA checks if pods could be rescheduled elsewhere
CA respects PodDisruptionBudgets and local storage
If safe, CA cordons, drains, and terminates the node

Scale-Down Blockers:

Pods with local storage (unless opt-in)
Pods in kube-system without PDB
Pods with restrictive PodDisruptionBudget
Standalone pods (not owned by controller)
Nodes with scale-down disabled annotation

cluster-autoscaler-deployment.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Cluster Autoscaler deployment (AWS EKS example)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
      - name: cluster-autoscaler
        image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.28.0
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste           # Node selection strategy
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
        - --balance-similar-node-groups    # Balance across AZs
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10m    # Wait 10m after scale-up
        - --scale-down-delay-after-delete=0s
        - --scale-down-unneeded-time=10m      # Must be unneeded for 10m
        - --scale-down-utilization-threshold=0.5  # Below 50% = underutilized
        - --max-node-provision-time=15m
        - --max-graceful-termination-sec=600
        resources:
          limits:
            cpu: 100m
            memory: 600Mi
          requests:
            cpu: 100m
            memory: 600Mi
      priorityClassName: system-cluster-critical

Cluster Autoscaler Expander Strategies

•random: Randomly pick from node groups that can satisfy demand (simple but unpredictable)
•most-pods: Choose node group that schedules the most pending pods
•least-waste: Choose node group that leaves least unused capacity after scheduling (cost-efficient)
•price: Choose cheapest node group (requires cloud pricing integration)
•priority: User-defined priority ordering via ConfigMap

Cluster Autoscaler Best Practices

Use multiple node groups (spot/preemptible for batch, on-demand for reliability). 2. Enable balance-similar-node-groups for multi-AZ clusters. 3. Set appropriate scale-down delays—10 minutes prevents thrashing. 4. Use pod-priority to ensure critical pods trigger scale-up first. 5. Always set resource requests; CA can't help unschedulable pods that fit on existing nodes.

Karpenter: Just-in-Time Node Provisioning

Key Differences from Cluster Autoscaler:

Cluster Autoscaler	Karpenter
Scales predefined node groups	Provisions nodes based on pod requirements
Limited instance type flexibility	Chooses from all compatible instance types
Slower (waits for ASG scaling)	Faster (direct EC2 API calls)
Simpler setup	More powerful but more configuration

How Karpenter Works:

Karpenter watches for unschedulable pods
Groups pending pods by compatible requirements (node selectors, tolerations, resource needs)
Calculates optimal instance type considering: pod requirements, architecture (ARM/AMD64), capacity type (on-demand/spot), availability zones
Provisions instance directly via cloud API
Node joins cluster, pods get scheduled

Karpenter enables:

Faster scaling (bypasses ASG dynamics)
Better bin-packing (right-sizes nodes for actual pod needs)
Mixed instance type strategies
Automatic spot instance diversification

karpenter-nodepool.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# Karpenter NodePool (v1beta1 API)
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  # Template for nodes provisioned by this pool
  template:
    spec:
      # Requirements for node selection
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64", "arm64"]
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand", "spot"]
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values: ["c", "m", "r"]        # Compute, general, memory-optimized
      - key: karpenter.k8s.aws/instance-size
        operator: In
        values: ["medium", "large", "xlarge", "2xlarge"]
      - key: topology.kubernetes.io/zone
        operator: In
        values: ["us-west-2a", "us-west-2b", "us-west-2c"]
      
      # Node configuration
      nodeClassRef:
        name: default
      
      # Taints applied to nodes (optional)
      taints:
      - key: example.com/special-hardware
        effect: NoSchedule
        value: "true"
  
  # Limits on total resources this pool can provision
  limits:
    cpu: 1000
    memory: 2000Gi
  
  # Disruption settings
  disruption:
    # Consolidation: pack pods onto fewer nodes when possible
    consolidationPolicy: WhenUnderutilized    # or WhenEmpty
    consolidateAfter: 30s
    
    # Budget for disruptions
    budgets:
    - nodes: "10%"        # Allow disrupting 10% of nodes at a time
 
---
# EC2NodeClass: AWS-specific node configuration
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2                # Amazon Linux 2
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: my-cluster
  role: KarpenterNodeRole-my-cluster
  
  # Block device configuration
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      volumeSize: 100Gi
      volumeType: gp3
      encrypted: true

When to Choose Karpenter vs Cluster Autoscaler

Scaling Anti-Patterns and Troubleshooting

Auto-scaling failures often stem from configuration mistakes rather than autoscaler bugs. Here are the most common anti-patterns and their solutions:

Anti-Pattern 1: Missing Resource Requests

HPA calculates utilization as current / request. Without requests, utilization is undefined. Pods show 0% or unknown utilization, and HPA cannot scale.

Solution: Always set resource requests. Use VPA in 'Off' mode to get recommendations if unsure.

Anti-Pattern 2: maxReplicas Too Low

HPA hits maxReplicas during traffic spike. Even though utilization is high, no more pods are added. Users experience degradation.

Solution: Set maxReplicas with headroom for unexpected spikes. Consider 2-3x peak expected replicas.

Anti-Pattern 3: Slow Pod Startup

Solution: Optimize container startup time. Use proper readiness probes. Consider pod priority and preemption for faster scheduling.

More Scaling Anti-Patterns

•Wrong metrics: Scaling on CPU for queue workers (should scale on queue depth); scaling on memory for stateless apps (memory often stable despite load changes)
•Too-sensitive targets: Setting 50% CPU target causes constant scaling with normal fluctuation. Start with 70-80% for CPU.
•Ignoring startup load: Applications that do heavy initialization (cache warming, model loading) spike resources on start, triggering immediate scale-up
•Pending pods without CA: HPA creates pods but no nodes available. Without Cluster Autoscaler, pods remain pending indefinitely.
•PodDisruptionBudget blocking scale-down: Overly restrictive PDBs prevent node draining, blocking Cluster Autoscaler scale-down

scaling-troubleshooting.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/bin/bash
# Troubleshooting HPA issues
 
# Check HPA status and events
kubectl describe hpa <hpa-name> -n <namespace>
 
# Key things to look for:
# - Current/Desired replicas
# - Current metrics vs targets
# - Conditions (ScalingActive, AbleToScale, ScalingLimited)
# - Events (scaling decisions, errors)
 
# Check if metrics server is working
kubectl top pods -n <namespace>
kubectl top nodes
 
# If metrics unavailable, check metrics-server
kubectl get pods -n kube-system -l k8s-app=metrics-server
kubectl logs -n kube-system -l k8s-app=metrics-server
 
# Check custom metrics API
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .
 
# Check external metrics API  
kubectl get --raw /apis/external.metrics.k8s.io/v1beta1 | jq .
 
# Debug Cluster Autoscaler
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=100
 
# Check for pending pods (indicates CA should scale up)
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
 
# Check node utilization (CA scale-down decisions)
kubectl describe nodes | grep -A5 "Allocated resources"

Summary: Auto-scaling Best Practices

Effective auto-scaling is essential for running efficient, responsive Kubernetes workloads. Let's consolidate the key principles:

Key Takeaways

•Always set resource requests: HPA utilization requires requests; Cluster Autoscaler needs them for scheduling simulation
•Choose metrics wisely: Scale on metrics that directly reflect user impact (RPS, latency, queue depth) rather than just CPU
•Configure behaviors carefully: Aggressive scale-up, conservative scale-down prevents thrashing
•Combine HPA and VPA safely: HPA on custom metrics + VPA on resources avoids conflicts
•Enable Cluster Autoscaler: Otherwise, pending pods stay pending indefinitely
•Consider Karpenter for AWS: Faster scaling, better bin-packing, more flexibility
•Monitor scaling metrics: Track replicas over time, throttling, OOM kills, pending pod duration
•Set appropriate min/max replicas: minReplicas for baseline availability, maxReplicas with headroom for spikes
•Test scaling behavior: Simulate load spikes before they happen in production
•Optimize pod startup time: Slow-starting pods cause over-provisioning during scale events

What's Next:

Page Complete

2 / 5