Loading content...
Not every application fits the stateless mold. Databases, message brokers, distributed caches, and consensus systems all share a common requirement: they need to be treated as individuals, not as interchangeable units in a pool.
Consider what happens when you run a PostgreSQL primary-replica setup. The primary instance must know it's the primary. Replicas need stable addresses to connect to the primary. Each instance requires its own persistent storage that survives pod restarts. Replacing a failed replica isn't as simple as spinning up a new, anonymous pod—it needs the same identity, the same network address, and access to the same data.
This is where StatefulSets enter the picture. They provide the guarantees that stateful applications demand while still leveraging Kubernetes' orchestration capabilities.
By the end of this page, you'll understand when and why to use StatefulSets instead of Deployments. You'll master the concepts of stable pod identities, headless services, persistent volume claims, and ordered operations. You'll also learn the operational patterns for running production databases and distributed systems on Kubernetes.
To understand why StatefulSets exist, we must first understand what Deployments cannot provide—even with persistent volumes attached.
The Deployment limitations for stateful workloads:
Deployments treat all pods as identical and interchangeable. This creates several problems for stateful applications:
| Requirement | Deployment Behavior | Why It's a Problem |
|---|---|---|
| Stable network identity | Pod names are random (e.g., api-7d8cf4d9-xz4rw) | Other pods can't reliably connect; DNS names change on restart |
| Persistent storage per pod | All pods can mount same volume or get random PVCs | Data isn't associated with specific pod identity |
| Ordered startup | All pods start simultaneously | Leader election, data initialization may conflict |
| Ordered shutdown | Pods terminate in arbitrary order | Graceful cluster shutdown requires reverse order |
| Ordered updates | Pods update with configurable parallelism | Rolling updates may disrupt quorum in distributed systems |
Real-world example: A Kafka cluster gone wrong
Imagine running a 3-broker Kafka cluster using a Deployment:
kafka-abc123, kafka-def456, kafka-ghi789kafka-jkl012kafka-abc123 hit DNS resolution failuresWith a StatefulSet, the replacement pod would retain the original name kafka-0, kafka-1, or kafka-2, reconnect to its original persistent storage, and resume its role seamlessly.
A common misconception is that StatefulSets are simply "Deployments with persistent volumes." This misses the point. The key differentiator is stable identity—the persistent, predictable naming that allows distributed systems to maintain their coordination state across pod restarts and rescheduling.
StatefulSets provide specific guarantees that collectively enable reliable operation of stateful applications. Understanding these guarantees—and their limitations—is crucial for correct usage.
$(statefulset-name)-$(ordinal). Pod mysql-0 will always be mysql-0, even if it's rescheduled to a different node.mysql-0 restarts, it reconnects to the same persistent volume.Headless Services: The Key to Stable DNS
StatefulSets rely on a headless service (a service with clusterIP: None) to provide stable DNS names for each pod. Unlike regular services that load-balance across pods, headless services return the individual pod IPs, enabling direct addressing.
The DNS naming pattern for StatefulSet pods:
$(pod-name).$(headless-service-name).$(namespace).svc.cluster.local
For a StatefulSet named mysql with headless service mysql-headless in the databases namespace:
mysql-0.mysql-headless.databases.svc.cluster.localmysql-1.mysql-headless.databases.svc.cluster.localmysql-2.mysql-headless.databases.svc.cluster.localLet's examine a production-grade StatefulSet configuration for a distributed database. We'll break down each section to understand its purpose:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159
# Headless Service - Required for stable network identityapiVersion: v1kind: Servicemetadata: name: mysql-headless namespace: databases labels: app: mysqlspec: clusterIP: None # This makes it a headless service ports: - port: 3306 name: mysql - port: 33060 name: mysqlx selector: app: mysql---# Regular Service - For client access with load balancingapiVersion: v1kind: Servicemetadata: name: mysql namespace: databasesspec: ports: - port: 3306 name: mysql selector: app: mysql role: primary # Only routes to primary---apiVersion: apps/v1kind: StatefulSetmetadata: name: mysql namespace: databasesspec: serviceName: mysql-headless # Links to headless service replicas: 3 # === Pod Management Policy === podManagementPolicy: OrderedReady # Default: sequential startup # Alternative: Parallel - for faster scaling when order doesn't matter # === Update Strategy === updateStrategy: type: RollingUpdate rollingUpdate: partition: 0 # Update all pods (set higher to canary) # === Selector === selector: matchLabels: app: mysql # === Pod Template === template: metadata: labels: app: mysql annotations: prometheus.io/scrape: "true" spec: terminationGracePeriodSeconds: 120 # Give MySQL time to flush # === Init Container: Configure based on ordinal === initContainers: - name: init-mysql image: mysql:8.0 command: - bash - "-c" - | set -ex # Extract ordinal from hostname (mysql-0 -> 0) [[ `hostname` =~ -([0-9]+)$ ]] || exit 1 ordinal=${BASH_REMATCH[1]} # Generate server-id from ordinal echo "[mysqld]" > /mnt/conf/server-id.cnf echo "server-id=$((100 + ordinal))" >> /mnt/conf/server-id.cnf # First pod is primary, others are replicas if [[ $ordinal -eq 0 ]]; then cp /mnt/config-map/primary.cnf /mnt/conf/ else cp /mnt/config-map/replica.cnf /mnt/conf/ fi volumeMounts: - name: conf mountPath: /mnt/conf - name: config-map mountPath: /mnt/config-map # === Main Container === containers: - name: mysql image: mysql:8.0 ports: - containerPort: 3306 name: mysql resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "4Gi" env: - name: MYSQL_ROOT_PASSWORD valueFrom: secretKeyRef: name: mysql-secrets key: root-password livenessProbe: exec: command: ["mysqladmin", "ping"] initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 readinessProbe: exec: command: - bash - "-c" - | mysql -h 127.0.0.1 -uroot -p$MYSQL_ROOT_PASSWORD -e "SELECT 1" initialDelaySeconds: 10 periodSeconds: 5 timeoutSeconds: 2 volumeMounts: - name: data mountPath: /var/lib/mysql - name: conf mountPath: /etc/mysql/conf.d volumes: - name: conf emptyDir: {} - name: config-map configMap: name: mysql-config # === Persistent Volume Claim Templates === volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] storageClassName: premium-ssd resources: requests: storage: 100GiOrderedReady (default) ensures sequential operations. Use Parallel only when pods don't depend on each other's state.$(volumeClaimTemplate.name)-$(pod-name), e.g., data-mysql-0.StatefulSets have a unique relationship with storage that differs fundamentally from Deployments. Understanding this relationship is critical for data safety and operational planning.
The PVC Retention Behavior:
This is one of the most important yet misunderstood aspects of StatefulSets:
PVCs are NOT deleted when pods are scaled down — This prevents accidental data loss but also means you must explicitly clean up unused PVCs.
PVCs are NOT deleted when the StatefulSet is deleted — Even kubectl delete sts mysql leaves all PVCs intact. This is a safety feature.
Scaling back up reuses existing PVCs — If you scale down from 3 to 2, then back to 3, the new mysql-2 pod gets the original data-mysql-2 PVC with all its data.
After scaling down a StatefulSet, you must manually delete unused PVCs if you want to reclaim storage. Kubernetes 1.27+ introduced a persistentVolumeClaimRetentionPolicy field that can automate this, but the default remains "Retain" for safety. Always verify data is backed up before deleting PVCs.
123456789101112131415
# Kubernetes 1.27+ PVC Retention PolicyapiVersion: apps/v1kind: StatefulSetmetadata: name: mysqlspec: # ... persistentVolumeClaimRetentionPolicy: # What happens when StatefulSet is deleted whenDeleted: Retain # Default: keep PVCs # Alternative: Delete - delete PVCs with StatefulSet # What happens when replica count is reduced whenScaled: Retain # Default: keep PVCs of scaled-down pods # Alternative: Delete - delete PVCs immediately on scale-down| Feature | Requirement | Recommendation |
|---|---|---|
| Access Mode | ReadWriteOnce (RWO) | Standard for single-pod attachment |
| Volume Binding | WaitForFirstConsumer | Ensures PV is provisioned in pod's zone |
| Expand | allowVolumeExpansion: true | Enables online storage growth |
| Reclaim Policy | Delete or Retain | Retain for production, Delete for dev |
| IOPS/Throughput | Match workload needs | Premium SSD for databases |
The ordered nature of StatefulSet operations is crucial for distributed systems that require coordination. Let's examine how ordering works and when to modify it.
Understanding Update Partitions:
The partition field in the update strategy is a powerful tool for staged rollouts. When partition is set to N, pods with ordinal >= N are updated, while pods with ordinal < N remain on the old version.
1234567891011121314151617181920
# Scenario: 5-node Elasticsearch cluster (es-0 through es-4)# Goal: Canary update on last 2 nodes first # Step 1: Set partition to 3 (pods 3,4 will update; pods 0,1,2 won't)kubectl patch statefulset es --type='json' -p='[ {"op": "replace", "path": "/spec/updateStrategy/rollingUpdate/partition", "value": 3}]' # Step 2: Update the image (only es-3 and es-4 update)kubectl set image statefulset/es elasticsearch=elasticsearch:8.11.1 # Step 3: Validate cluster healthkubectl exec es-0 -- curl -s localhost:9200/_cluster/health # Step 4: Lower partition to update more nodeskubectl patch statefulset es --type='json' -p='[ {"op": "replace", "path": "/spec/updateStrategy/rollingUpdate/partition", "value": 0}]' # Now all nodes update in order: es-2, es-1, es-0For production distributed systems, use partitions to implement a staged rollout: (1) Update the highest ordinal pod first, (2) Verify cluster health, (3) Gradually lower the partition, (4) Monitor cluster behavior at each stage. This gives you multiple opportunities to catch issues before affecting your primary/leader nodes (typically the lowest ordinals).
Running stateful applications in production requires patterns beyond basic StatefulSet configuration. Here are battle-tested approaches for common scenarios:
Leader Election with Pod Ordinal
Many applications use the StatefulSet ordinal for leader election:
hostname12345678910111213141516171819
initContainers:- name: configure-role image: busybox:1.36 command: - sh - -c - | ORDINAL=$(echo $HOSTNAME | grep -o '[0-9]*$') if [ "$ORDINAL" = "0" ]; then echo "primary" > /config/role echo "ROLE=primary" >> /config/env else echo "replica" > /config/role echo "ROLE=replica" >> /config/env echo "PRIMARY_HOST=db-0.db-headless" >> /config/env fi volumeMounts: - name: config mountPath: /configFor production databases, consider using Kubernetes Operators instead of manually managing StatefulSets. Operators like PostgreSQL Operator (Zalando/CrunchyData), MySQL Operator (Oracle/Percona), or Strimzi (Kafka) encode operational knowledge—backup, failover, scaling, upgrades—into automated controllers that reduce human error and operational burden.
StatefulSet issues often involve storage, ordering, or identity—dimensions that don't exist with Deployments. Here's a systematic debugging approach:
123456789101112131415161718192021222324252627282930
# 1. Check StatefulSet status and conditionskubectl describe statefulset mysql # Look for:# - CurrentReplicas vs ReadyReplicas# - UpdateRevision vs CurrentRevision (update progress)# - Events indicating failures # 2. Check pod ordering and statekubectl get pods -l app=mysql -o wide# Pods should be sequentially numbered# Earlier pods should be Ready before later ones start # 3. Verify PVC bindingskubectl get pvc -l app=mysql# Each PVC should be Bound to a PV# PVC names should match pattern: data-mysql-N # 4. Check PV provisioningkubectl describe pvc data-mysql-0# Look for: Events, StorageClass, VolumeMode # 5. Debug ordering issueskubectl get pods -l app=mysql -w# Watch pods during scale up/down# If pod-N starts before pod-N-1 is Ready, there's an issue # 6. Verify headless service DNSkubectl run -it --rm debug --image=busybox --restart=Never -- nslookup mysql-headless# Should return A records for each pod| Symptom | Likely Cause | Solution |
|---|---|---|
| Pods stuck in Pending | PVC cannot be provisioned (no PV, wrong zone) | Check StorageClass, CSI driver logs, PVC events |
| Scale-up blocked at pod N | Pod N-1 not Ready | Debug pod N-1's readiness probe, logs |
| DNS resolution fails for pod | Headless service misconfigured | Verify serviceName matches headless service name |
| Pod restarts with fresh data | PVC not correctly bound | Check volumeMounts, PVC name pattern |
| Update stuck on pod N | Partition set too high | Lower updateStrategy.partition value |
| Old pods not deleting during update | Pods repeatedly failing readiness | Check logs, fix application issues |
Let's consolidate the essential knowledge about StatefulSets:
pod-name.service-name.namespace.svc.cluster.local.Parallel when order doesn't matter.updateStrategy.partition to N to only update pods with ordinal >= N.What's next:
Now that you understand StatefulSets for applications requiring stable identity and storage, we'll explore DaemonSets—the workload type for running exactly one pod on every (or selected) node. DaemonSets are essential for node-level concerns like logging agents, monitoring exporters, and network plugins.
You now have a comprehensive understanding of StatefulSets. You can configure stateful applications with stable identities, persistent storage, and ordered operations. You understand the storage lifecycle, partition-based updates, and operational patterns for distributed systems. Next, we'll explore DaemonSets for node-level workloads.