Loading learning content...
Deploying software updates in production is one of the most critical—and potentially dangerous—operations in system administration. A misconfigured deployment can take down a live system serving millions of users. A slow rollout can leave users on inconsistent versions. A botched rollback can make a bad situation catastrophic.
Kubernetes provides sophisticated deployment machinery that, when properly configured, enables zero-downtime deployments where users experience no interruption during updates. But these tools are only as good as your understanding of them.
This page covers:
By the end of this page, you'll understand how to configure rolling updates that eliminate user-facing downtime, implement instant rollbacks when deployments fail, design canary and blue-green strategies for risk mitigation, and use progressive delivery tools for production-grade release management.
A rolling update incrementally replaces old pods with new pods, ensuring that some instances of the application are always available during the transition. This is Kubernetes' default deployment strategy.
How Rolling Updates Work:
maxSurge and maxUnavailableThe Critical Parameters:
These parameters work in tension: higher values mean faster updates but more resource consumption (maxSurge) or more risk (maxUnavailable).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
apiVersion: apps/v1kind: Deploymentmetadata: name: web-application namespace: productionspec: replicas: 10 # Rolling update strategy configuration strategy: type: RollingUpdate rollingUpdate: # Maximum pods above replica count during update maxSurge: 25% # Can also be absolute: "2" # Maximum pods unavailable during update maxUnavailable: 25% # Can also be absolute: "0" for zero-downtime # How many old ReplicaSets to keep for rollback revisionHistoryLimit: 10 # Time to wait before considering pod update failed progressDeadlineSeconds: 600 # 10 minutes # Minimum time to wait before marking pod Ready minReadySeconds: 10 selector: matchLabels: app: web-application template: metadata: labels: app: web-application version: v2.1.0 spec: containers: - name: web image: myregistry/web-application:v2.1.0 ports: - containerPort: 8080 # Readiness probe is CRITICAL for rolling updates readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 failureThreshold: 3 # Liveness probe detects stuck containers livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 15 periodSeconds: 10 failureThreshold: 3| Pattern | maxSurge | maxUnavailable | Behavior | Use Case |
|---|---|---|---|---|
| Conservative | 1 | 0 | One new pod at a time, never fewer than desired | Critical services, limited capacity |
| Balanced (default) | 25% | 25% | Update 25% at a time | General production workloads |
| Aggressive | 100% | 0 | Double capacity, then cut old pods | Fast updates with spare capacity |
| Fast but risky | 0 | 50% | Replace half immediately | Dev environments, fast iteration |
Readiness probes are the most critical component of safe rolling updates. Without proper readiness configuration, Kubernetes cannot determine when new pods are truly ready to serve traffic, leading to:
How Readiness Probes Affect Rolling Updates:
NotReady state until probe succeedsThe minReadySeconds Setting:
Even after readiness probe succeeds, minReadySeconds adds an additional wait period before the pod is considered fully "available". This catches issues that only manifest after a few seconds of traffic.
spec:
minReadySeconds: 30 # Wait 30s after ready before proceeding
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
apiVersion: apps/v1kind: Deploymentmetadata: name: production-appspec: replicas: 5 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 # Zero downtime minReadySeconds: 30 # Extra stabilization time template: spec: containers: - name: app image: myapp:v1.2.3 # STARTUP PROBE: For slow-starting applications # Only checked at startup, replaces liveness until success startupProbe: httpGet: path: /health/startup port: 8080 initialDelaySeconds: 0 periodSeconds: 5 failureThreshold: 30 # Allow 2.5 minutes for startup # LIVENESS PROBE: Is the container stuck? # Failure = container restart livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 0 # Starts after startup probe succeeds periodSeconds: 10 failureThreshold: 3 # Restart after 3 consecutive failures timeoutSeconds: 5 # READINESS PROBE: Can the container serve traffic? # Failure = removed from service endpoints readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 # One success = ready failureThreshold: 3 # Three failures = not ready timeoutSeconds: 3 # Resources for predictable startup time resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "1000m"When a deployment goes wrong, the ability to quickly rollback can mean the difference between a minor incident and a major outage. Kubernetes provides robust rollback capabilities through revision history.
How Revision History Works:
Every time you update a Deployment and it creates a new ReplicaSet, Kubernetes stores that ReplicaSet as a "revision". By default, the last 10 revisions are kept (revisionHistoryLimit: 10).
Rollback is not re-deployment—it's instant:
When you roll back, Kubernetes doesn't rebuild images or re-run pipelines. It simply reactivates an existing ReplicaSet that was already proven to work. This makes rollback extremely fast.
Triggering Rollback:
1234567891011121314151617181920212223242526272829303132333435363738
#!/bin/bash# View rollout historykubectl rollout history deployment/web-application # Sample output:# REVISION CHANGE-CAUSE# 1 Initial deployment# 2 kubectl set image deployment/web-application web=myapp:v1.1# 3 kubectl set image deployment/web-application web=myapp:v1.2# 4 kubectl set image deployment/web-application web=myapp:v1.3 # View details of a specific revisionkubectl rollout history deployment/web-application --revision=2 # Rollback to the previous revisionkubectl rollout undo deployment/web-application # Rollback to a specific revisionkubectl rollout undo deployment/web-application --to-revision=2 # Check rollout statuskubectl rollout status deployment/web-application # Pause a rollout (useful if you spot problems mid-update)kubectl rollout pause deployment/web-application # Resume a paused rolloutkubectl rollout resume deployment/web-application # Restart all pods (triggers rolling restart with same image)kubectl rollout restart deployment/web-application # View current ReplicaSets (each is a revision)kubectl get replicasets -l app=web-application # Pro tip: Annotate deployments for useful historykubectl annotate deployment/web-application \ kubernetes.io/change-cause="Deploy v1.4 - added caching feature"Automatic Rollback with progressDeadlineSeconds:
If a deployment doesn't complete within progressDeadlineSeconds, it's marked as failed. While Kubernetes doesn't automatically roll back, this status enables external tools (Argo CD, Flux) to trigger automatic rollback.
What Constitutes Progress:
If neither happens for progressDeadlineSeconds, the deployment is stuck.
Manual vs Automated Rollback:
| Strategy | Trigger | Speed | Use Case |
|---|---|---|---|
| kubectl rollout undo | Manual detection | Instant | Operator-initiated rollback |
| GitOps revert | Git commit revert | Minutes | Infrastructure-as-code workflows |
| Argo Rollouts abort | Metric analysis | Seconds | Automated canary rollback |
| progressDeadlineSeconds | Timeout detection | Minutes | Detect stuck deployments |
Blue-green deployment is a release strategy where you maintain two complete production environments: "blue" (current) and "green" (new). You deploy to green, verify it works, then instantly switch all traffic from blue to green.
Advantages:
Disadvantages:
Native Kubernetes Blue-Green:
Kubernetes doesn't have a built-in blue-green resource, but you can implement it using labels and Service selectors:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
# Blue Deployment (currently serving production traffic)apiVersion: apps/v1kind: Deploymentmetadata: name: app-blue labels: app: myapp version: bluespec: replicas: 5 selector: matchLabels: app: myapp version: blue template: metadata: labels: app: myapp version: blue spec: containers: - name: app image: myapp:v1.0.0 ---# Green Deployment (new version, ready to receive traffic)apiVersion: apps/v1kind: Deploymentmetadata: name: app-green labels: app: myapp version: greenspec: replicas: 5 # Scaled up, fully ready selector: matchLabels: app: myapp version: green template: metadata: labels: app: myapp version: green spec: containers: - name: app image: myapp:v2.0.0 ---# Production Service (selector points to current live version)apiVersion: v1kind: Servicemetadata: name: app-productionspec: selector: app: myapp version: blue # <-- Change to 'green' to switch traffic ports: - port: 80 targetPort: 8080 ---# Preview Service (always points to new version for testing)apiVersion: v1kind: Servicemetadata: name: app-previewspec: selector: app: myapp version: green # <-- Test new version before switching ports: - port: 80 targetPort: 8080Switching Traffic:
# Switch production traffic from blue to green
kubectl patch service app-production
-p '{"spec":{"selector":{"version":"green"}}}'
# Verify the switch
kubectl get service app-production -o jsonpath='{.spec.selector.version}'
# If problems, switch back instantly
kubectl patch service app-production
-p '{"spec":{"selector":{"version":"blue"}}}'
# After successful switch, scale down old version
kubectl scale deployment app-blue --replicas=0
Blue-Green with Ingress:
For more sophisticated traffic management, use Ingress controllers or service mesh to route traffic between versions based on headers, weights, or other criteria.
The hardest part of blue-green is database schema changes. Both versions must be compatible with the same schema. Use the expand-contract pattern: first deploy schema changes that work with both versions, then deploy new code, then remove old schema elements in a future release.
Canary deployments gradually shift traffic to a new version while monitoring for problems. Unlike blue-green (all-or-nothing switch), canary allows you to catch issues with minimal user impact.
Why "Canary"?
The term comes from coal mining, where canaries were used to detect dangerous gases. If the canary died, miners knew to evacuate. Similarly, a canary deployment exposes a small percentage of traffic to the new version—if it "dies" (experiences errors), you abort before affecting most users.
Canary vs Rolling Updates:
| Rolling Update | Canary |
|---|---|
| Updates pods continuously | Pauses at defined traffic percentages |
| No traffic split control | Precise traffic routing control |
| Harder to abort mid-way | Easy abort by routing all to stable |
| Built into Kubernetes | Requires additional tooling |
Native Kubernetes Canary (Basic):
You can achieve basic canary behavior by running two Deployments with different replica counts behind the same Service:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
# Stable Deployment: 90% of traffic (9 replicas)apiVersion: apps/v1kind: Deploymentmetadata: name: myapp-stablespec: replicas: 9 selector: matchLabels: app: myapp track: stable template: metadata: labels: app: myapp track: stable spec: containers: - name: app image: myapp:v1.0.0 ---# Canary Deployment: 10% of traffic (1 replica)apiVersion: apps/v1kind: Deploymentmetadata: name: myapp-canaryspec: replicas: 1 # 1 out of 10 total = ~10% traffic selector: matchLabels: app: myapp track: canary template: metadata: labels: app: myapp track: canary spec: containers: - name: app image: myapp:v2.0.0 ---# Service routes to BOTH based on 'app' label# Traffic distribution follows replica ratioapiVersion: v1kind: Servicemetadata: name: myappspec: selector: app: myapp # Matches both stable and canary ports: - port: 80 targetPort: 8080Limitations of Native Canary:
For production canary deployments, use Argo Rollouts, Flagger, or service mesh (Istio/Linkerd) for:
A common canary progression: 1% → 5% → 10% → 25% → 50% → 100%. Start with 1% for critical services to catch catastrophic bugs with minimal impact. Increase faster for well-tested changes in non-critical systems.
Argo Rollouts is a Kubernetes controller and set of CRDs that provide advanced deployment capabilities including blue-green, canary, and experimentation features that go far beyond native Kubernetes.
Key Features:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
apiVersion: argoproj.io/v1alpha1kind: Rolloutmetadata: name: myapp namespace: productionspec: replicas: 10 # Selector and template work like Deployment selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: app image: myapp:v2.0.0 ports: - containerPort: 8080 readinessProbe: httpGet: path: /health port: 8080 # Canary strategy with automated analysis strategy: canary: # Traffic routing via Nginx Ingress canaryService: myapp-canary stableService: myapp-stable # Traffic management trafficRouting: nginx: stableIngress: myapp-ingress # Canary progression steps steps: # Step 1: 5% traffic for 5 minutes - setWeight: 5 - pause: {duration: 5m} # Step 2: Run automated analysis - analysis: templates: - templateName: success-rate-analysis args: - name: service-name value: myapp-canary # Step 3: If analysis passes, 25% for 10 minutes - setWeight: 25 - pause: {duration: 10m} # Step 4: More analysis - analysis: templates: - templateName: success-rate-analysis args: - name: service-name value: myapp-canary # Step 5: 50% traffic - setWeight: 50 - pause: {duration: 10m} # Step 6: Final analysis before full rollout - analysis: templates: - templateName: success-rate-analysis args: - name: service-name value: myapp-canary # Analysis configuration analysis: # How long to wait for analysis to complete successfulRunHistoryLimit: 3 unsuccessfulRunHistoryLimit: 3 ---# AnalysisTemplate: Defines what metrics to checkapiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: success-rate-analysisspec: args: - name: service-name metrics: - name: success-rate # Query Prometheus for success rate provider: prometheus: address: http://prometheus:9090 query: | sum(rate(http_requests_total{ service="{{args.service-name}}", status=~"2.." }[5m])) / sum(rate(http_requests_total{ service="{{args.service-name}}" }[5m])) * 100 # Success criteria successCondition: result[0] >= 99 # 99% success rate required failureCondition: result[0] < 95 # Below 95% = immediate failure interval: 1m count: 5kubectl argo rollouts get rollout myapp - View rollout status and stepskubectl argo rollouts promote myapp - Manually promote to next stepkubectl argo rollouts abort myapp - Abort rollout and revert to stablekubectl argo rollouts retry rollout myapp - Retry failed rolloutkubectl argo rollouts set image myapp app=myapp:v3.0 - Trigger new rolloutkubectl argo rollouts dashboard - Open web dashboard (port 3100)Pod Disruption Budgets (PDBs) limit the number of pods that can be voluntarily disrupted at once. They protect against overly aggressive updates, cluster autoscaler drains, and admin maintenance operations.
What Counts as Voluntary Disruption:
kubectl drain for maintenanceWhat Doesn't Count (Involuntary):
PDB Configuration:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
# Option 1: Minimum available podsapiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: myapp-pdb namespace: productionspec: # At least 3 pods must always be available minAvailable: 3 # Which pods this PDB protects selector: matchLabels: app: myapp ---# Option 2: Maximum unavailable podsapiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: myapp-pdb-percentage namespace: productionspec: # At most 20% of pods can be unavailable maxUnavailable: 20% selector: matchLabels: app: myapp ---# Option 3: For single-instance stateful workloads# "Unhealthy pod eviction" allows evicting stuck podsapiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: database-pdbspec: maxUnavailable: 0 # Never disrupt voluntarily selector: matchLabels: app: postgres-primary # As of K8s 1.27, can configure unhealthy pod eviction unhealthyPodEvictionPolicy: AlwaysAllow # or IfHealthyBudget| Workload Type | Recommended PDB | Rationale |
|---|---|---|
| Stateless, 3+ replicas | maxUnavailable: 1 or 25% | Allows gradual updates while maintaining quorum |
| Database primary | maxUnavailable: 0 | Never voluntarily disrupt the primary |
| Database replicas | maxUnavailable: 1 | Maintain read capacity during maintenance |
| Kafka brokers | minAvailable: N-1 (where N = replication factor) | Maintain quorum for partitions |
| Stateful with 2 replicas | maxUnavailable: 1 | Allow rolling updates one at a time |
For truly zero-downtime deployments, pods must handle termination gracefully—finishing in-flight requests before exiting. Kubernetes provides a termination lifecycle, but your application must cooperate.
Pod Termination Sequence:
terminationGracePeriodSeconds, SIGKILL sent (force kill)The Race Condition Problem:
Step 3's operations happen in parallel but aren't atomic. The pod might receive new requests after SIGTERM was sent but before endpoints update propagates to all load balancers. This causes errors for users connecting to a dying pod.
The PreStop Hook Solution:
12345678910111213141516171819202122232425262728293031323334353637383940
apiVersion: apps/v1kind: Deploymentmetadata: name: web-servicespec: template: spec: # Give pods plenty of time to drain terminationGracePeriodSeconds: 60 containers: - name: web image: myapp:latest lifecycle: preStop: exec: # Wait for endpoints to update before stopping command: - /bin/sh - -c - | echo "Starting graceful shutdown..." sleep 15 # Wait for LB to stop sending traffic echo "Draining connections..." # Or call application-specific drain endpoint # curl -X POST localhost:8080/admin/drain # Application must handle SIGTERM # Most frameworks do this automatically: # - Node.js: process.on('SIGTERM', handler) # - Java: Runtime.addShutdownHook() # - Go: signal.Notify() # For sidecars that should stop last - name: sidecar lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 20"]Complete Graceful Shutdown Checklist:
Set adequate terminationGracePeriodSeconds: Default is 30s, but long-running requests may need 60-120s
Add preStop hook with sleep: 10-15 seconds allows endpoints to update and LBs to drain
Application handles SIGTERM: Stop accepting new connections, finish in-flight requests
Return 5xx from health checks during shutdown: Some apps mark themselves unhealthy to speed up removal from load balancers
Set proper timeouts: Connection read/write timeouts should be less than terminationGracePeriodSeconds
Cloud Load Balancer Considerations:
Cloud load balancers (ALB, NLB, GCP LB) have their own health check and deregistration delays. Configure:
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: deregistration_delay.timeout_seconds=30Test graceful shutdown during load! Many issues only appear under real traffic. Use tools like wrk or k6 to generate load, then trigger a rolling update and watch for error spikes.
Safe, reliable deployments are the result of careful configuration and thorough testing. Let's consolidate the key principles:
What's Next:
With deployment strategies mastered, the next page covers Monitoring and Logging—essential observability practices that let you understand what's happening in your cluster, detect problems before they become outages, and debug issues when they occur.
You now have comprehensive knowledge of Kubernetes deployment strategies—from rolling updates and rollbacks through blue-green, canary, and progressive delivery with Argo Rollouts. Apply these patterns to achieve reliable, zero-downtime releases in production.