Loading content...
While Deployments excel at managing stateless applications, they lack the guarantees required by databases, message queues, and other stateful systems. These applications need:
StatefulSets provide these guarantees by combining stable pod identities with persistent volume management. Unlike Deployments where pods are interchangeable, StatefulSet pods have predictable names (pod-0, pod-1, pod-2) and each gets a dedicated PVC that follows it through restarts and reschedules.
This page explores how StatefulSets manage storage—the volumeClaimTemplates mechanism, storage identity guarantees, scaling behaviors, and patterns for running production databases on Kubernetes.
By the end of this page, you will understand volumeClaimTemplates, stable storage identity, the relationship between pod and PVC lifecycles, scaling behaviors for storage, PVC retention policies introduced in Kubernetes 1.27+, and production patterns for stateful applications.
StatefulSets use a fundamentally different storage model than Deployments. Understanding this model is crucial for designing reliable stateful applications.
The core difference:
Deployments: All pods share PVCs defined in the pod spec. Pods are fungible—any can attach to any available PVC (if using RWX) or they compete for RWO volumes.
StatefulSets: Each pod gets its own PVC, automatically created from a template. The PVC name includes the pod ordinal (pod name), creating a permanent bond between specific pods and specific volumes.
The identity binding:
| Component | Pattern | Example | Persists Across |
|---|---|---|---|
| Pod Name | <statefulset>-<ordinal> | mysql-0, mysql-1, mysql-2 | Reschedules, restarts |
| PVC Name | <volumeClaimTemplate>-<statefulset>-<ordinal> | data-mysql-0, data-mysql-1 | Pod deletion, scale down |
| DNS Name | <pod>.<service>.namespace.svc.cluster.local | mysql-0.mysql-headless.db.svc.cluster.local | Reschedules, restarts |
| Ordinal Index | 0, 1, 2, ... (0-indexed) | 0 is primary, 1+ are replicas | Pod lifetime |
The storage guarantee:
When mysql-0 is deleted and recreated (due to node failure, manual deletion, or rolling update):
data-mysql-0This guarantee is what makes StatefulSets suitable for databases—the primary (mysql-0) always gets the primary's data, regardless of which physical node runs it.
By default, StatefulSet PVCs are NOT deleted when pods are deleted or the StatefulSet is scaled down. This is intentional—data preservation is the priority. Manual cleanup or PVC retention policies (Kubernetes 1.27+) are required to remove PVCs.
The volumeClaimTemplates field is the mechanism by which StatefulSets create PVCs. It's a list of PVC specifications that act as templates—for each pod, Kubernetes creates one PVC per template.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
apiVersion: apps/v1kind: StatefulSetmetadata: name: postgresql namespace: databasespec: serviceName: postgresql-headless # Required: headless service for DNS replicas: 3 selector: matchLabels: app: postgresql # Pod template - standard pod spec template: metadata: labels: app: postgresql spec: containers: - name: postgresql image: postgres:15 ports: - containerPort: 5432 name: postgres env: - name: PGDATA value: /var/lib/postgresql/data/pgdata volumeMounts: # Mount the data volume (from volumeClaimTemplate) - name: data mountPath: /var/lib/postgresql/data # Mount WAL volume for separate WAL storage - name: wal mountPath: /var/lib/postgresql/wal resources: requests: memory: "1Gi" cpu: "500m" limits: memory: "4Gi" cpu: "2" # Init container for permissions initContainers: - name: init-permissions image: busybox command: ['sh', '-c', 'chown -R 999:999 /var/lib/postgresql/data /var/lib/postgresql/wal'] volumeMounts: - name: data mountPath: /var/lib/postgresql/data - name: wal mountPath: /var/lib/postgresql/wal # VolumeClaimTemplates - PVC templates for each pod volumeClaimTemplates: # Primary data volume - high-performance SSD - metadata: name: data labels: app: postgresql component: data spec: accessModes: ["ReadWriteOnce"] storageClassName: fast-ssd resources: requests: storage: 100Gi # WAL volume - separate disk for write-ahead logs # Improves performance by isolating sequential WAL writes - metadata: name: wal labels: app: postgresql component: wal spec: accessModes: ["ReadWriteOnce"] storageClassName: fast-ssd resources: requests: storage: 20Gi ---# Headless service for stable DNSapiVersion: v1kind: Servicemetadata: name: postgresql-headless namespace: databasespec: clusterIP: None # Headless - no load balancing selector: app: postgresql ports: - port: 5432 targetPort: 5432 name: postgresPVC creation mechanics:
When the above StatefulSet is created with 3 replicas, Kubernetes creates:
Each pod gets exactly one PVC per template, named <template-name>-<statefulset-name>-<ordinal>.
VolumeClaimTemplates cannot be modified after StatefulSet creation. Changing storage size or class requires creating a new StatefulSet or using manual PVC resizing. Plan storage requirements carefully before deployment.
StatefulSets provide ordered, graceful deployment that's critical for distributed systems where startup order matters (e.g., database clusters where primaries must initialize before replicas).
Default ordered behavior (OrderedReady):
The startup sequence:
1234567891011121314151617181920
# StatefulSet with 3 replicas - startup sequence Time Action Status────────────────────────────────────────────────────────────────────T+0 Create StatefulSet T+1 Create PVC data-app-0 Pending (waiting for binding)T+2 PVC data-app-0 bound BoundT+3 Create Pod app-0 Pending (waiting for scheduling)T+4 Pod app-0 scheduled, containers startingT+5 Pod app-0 Running Running (not yet Ready)T+6 Pod app-0 passes readiness probe Running, Ready ✓ ──── pod-0 Ready, proceed to pod-1 ────T+7 Create PVC data-app-1 Pending → BoundT+8 Create Pod app-1 Pending → RunningT+9 Pod app-1 Ready Running, Ready ✓ ──── pod-1 Ready, proceed to pod-2 ────T+10 Create PVC data-app-2 Pending → BoundT+11 Create Pod app-2 Pending → RunningT+12 Pod app-2 Ready Running, Ready ✓ ──── All replicas ready ────Parallel pod management:
For workloads that don't require strict ordering (e.g., sharded databases where each shard is independent), you can use parallel pod management:
12345678910111213141516171819202122232425262728293031323334
apiVersion: apps/v1kind: StatefulSetmetadata: name: sharded-cachespec: serviceName: sharded-cache replicas: 10 # Parallel pod management - all pods start simultaneously podManagementPolicy: Parallel selector: matchLabels: app: sharded-cache template: metadata: labels: app: sharded-cache spec: containers: - name: cache image: redis:7 volumeMounts: - name: data mountPath: /data volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] storageClassName: standard resources: requests: storage: 10Gi| Policy | Creation Order | Update Order | Use Case |
|---|---|---|---|
| OrderedReady | Sequential (0→N) | Reverse (N→0) | Primary-replica databases, consensus systems |
| Parallel | Simultaneous | Simultaneous | Sharded systems, stateless-like with stable IDs |
Understanding how StatefulSets handle scaling is critical for capacity planning and cost management. The behavior differs significantly from Deployments.
Scale up behavior:
Scale down behavior (critical to understand):
1234567891011121314151617181920212223242526
# Scale up from 3 to 5 replicaskubectl scale statefulset mysql --replicas=5 # Result:# Pods: mysql-0, mysql-1, mysql-2, mysql-3, mysql-4# PVCs: data-mysql-0, data-mysql-1, data-mysql-2, data-mysql-3, data-mysql-4 # Scale down from 5 to 2 replicaskubectl scale statefulset mysql --replicas=2 # Result:# Pods: mysql-0, mysql-1 (mysql-2,3,4 deleted)# PVCs: data-mysql-0, data-mysql-1, data-mysql-2, data-mysql-3, data-mysql-4# ↑ ALL PVCs still exist! # The orphaned PVCs still consume storagekubectl get pvc -l app=mysql# NAME STATUS VOLUME CAPACITY ACCESS MODES# data-mysql-0 Bound pv-xxx 100Gi RWO# data-mysql-1 Bound pv-yyy 100Gi RWO# data-mysql-2 Bound pv-zzz 100Gi RWO ← Orphaned!# data-mysql-3 Bound pv-aaa 100Gi RWO ← Orphaned!# data-mysql-4 Bound pv-bbb 100Gi RWO ← Orphaned! # Manual cleanup when data is no longer neededkubectl delete pvc data-mysql-2 data-mysql-3 data-mysql-4Orphaned PVCs from scale-down operations continue incurring cloud storage costs. Implement monitoring for orphaned PVCs and regular cleanup procedures. Some organizations use controllers to automatically notify on or delete stale PVCs.
Kubernetes 1.27 introduced StatefulSet PVC Auto Deletion as a stable feature, providing automated control over PVC lifecycle relative to pods and StatefulSets.
The persistentVolumeClaimRetentionPolicy:
This field controls when PVCs are automatically deleted:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
apiVersion: apps/v1kind: StatefulSetmetadata: name: disposable-workersspec: serviceName: workers replicas: 5 # PVC retention policy - controls automatic PVC deletion persistentVolumeClaimRetentionPolicy: # What happens when the StatefulSet is deleted whenDeleted: Delete # Options: Retain (default), Delete # What happens when the pod is scaled down whenScaled: Delete # Options: Retain (default), Delete selector: matchLabels: app: worker template: metadata: labels: app: worker spec: containers: - name: worker image: worker:v1 volumeMounts: - name: scratch mountPath: /scratch volumeClaimTemplates: - metadata: name: scratch spec: accessModes: ["ReadWriteOnce"] storageClassName: standard resources: requests: storage: 50Gi ---# Production database - preserve data on scale down and deleteapiVersion: apps/v1kind: StatefulSetmetadata: name: production-dbspec: serviceName: production-db replicas: 3 persistentVolumeClaimRetentionPolicy: whenDeleted: Retain # Keep PVCs even if StatefulSet deleted whenScaled: Retain # Keep PVCs even if pods scaled down # ... rest of spec| Scenario | whenScaled | whenDeleted | Result |
|---|---|---|---|
| Preserve all data (default) | Retain | Retain | PVCs never auto-deleted, manual cleanup required |
| Clean up on scale down | Delete | Retain | Scaling down deletes PVCs; StatefulSet deletion preserves them |
| Clean up on deletion | Retain | Delete | PVCs kept during scaling, deleted with StatefulSet |
| Ephemeral storage | Delete | Delete | PVCs always auto-deleted when pods go away |
Use Delete policies for: worker nodes with scratch storage, CI/CD runners with build caches, ML training jobs with checkpoints that become irrelevant. Always use Retain for production databases, message queue state, and any data that needs backup before deletion.
The stable storage identity provided by StatefulSets enables powerful recovery patterns. Understanding these patterns is essential for designing highly available stateful systems.
Pod failure recovery:
When a StatefulSet pod fails (node crash, OOM killed, etc.):
For block storage (EBS, GCE PD), this may require waiting for the volume to detach from the failed node—a process that can take several minutes if the node is unresponsive.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
# Scenario 1: Pod crash recovery# Pod mysql-0 crashes on node-1# Timeline:# T+0: mysql-0 crashes (OOM, application error)# T+1: kubelet detects, reports to API server# T+2: StatefulSet controller creates replacement mysql-0# T+3: Scheduler selects node-2 (node-1 may be viable again)# T+5: PVC data-mysql-0 attached to mysql-0 on node-2# T+6: mysql-0 starts with existing data## Total recovery: ~10-30 seconds for healthy nodes # Scenario 2: Node failure recovery# node-1 fails (hardware, network partition)# Timeline:# T+0: node-1 becomes unresponsive# T+5min: kubelet heartbeat fails, node marked NotReady# T+10min: pod.spec.tolerations.node.kubernetes.io/not-ready:NoExecute expires# T+10min: Pod evicted from failed node# T+10min: StatefulSet creates replacement pod# T+12min: Volume detach timeout, force detach issued# T+13min: PVC attached to new pod# T+14min: Pod running with data## Total recovery: ~10-15 minutes worst case # Decrease recovery time with pod disruption budget and# volume attachment tuning:---apiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: mysql-pdbspec: minAvailable: 2 # Maintain quorum during disruptions selector: matchLabels: app: mysql ---# Force detach timeout (CSI driver configuration example)# Decrease time before force-detaching from failed nodesapiVersion: storage.k8s.io/v1kind: CSIDrivermetadata: name: ebs.csi.aws.comspec: attachRequired: true podInfoOnMount: false # volumeLifecycleModes: ["Persistent", "Ephemeral"]Manual recovery patterns:
Sometimes automated recovery isn't enough. Manual intervention patterns include:
kubectl delete pod mysql-0 --force --grace-period=0 when pods won't terminateForce deleting pods bypasses graceful shutdown. For databases, this can cause data corruption. Only use force delete when you're certain the original pod is unreachable and cannot perform writes. Combine with application-level checks (verify primary is truly dead before promoting replica).
Production stateful applications often benefit from multiple volumes with different characteristics. VolumeClaimTemplates support this pattern natively.
Common multi-volume patterns:
| Pattern | Volumes | Rationale |
|---|---|---|
| Data + WAL | Primary data, write-ahead logs | Isolate sequential WAL writes from random data I/O |
| Data + Logs | Application data, application logs | Different retention, separate backup strategies |
| Hot + Cold | Fast SSD, cheap HDD | Tiered storage within same application |
| Data + Config | Persistent data, configuration files | Different update patterns, security considerations |
| Data + Temp | Persistent data, ephemeral scratch | Scratch space doesn't need persistence |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
# Elasticsearch with optimized multi-volume layoutapiVersion: apps/v1kind: StatefulSetmetadata: name: elasticsearchspec: serviceName: elasticsearch replicas: 3 selector: matchLabels: app: elasticsearch template: metadata: labels: app: elasticsearch spec: containers: - name: elasticsearch image: elasticsearch:8.10.0 env: - name: path.data value: /usr/share/elasticsearch/data - name: path.logs value: /var/log/elasticsearch volumeMounts: # Primary data volume - fast SSD - name: data mountPath: /usr/share/elasticsearch/data # Logs volume - standard storage, can tolerate loss - name: logs mountPath: /var/log/elasticsearch # Snapshots volume - cheaper storage for backups - name: snapshots mountPath: /snapshots resources: requests: memory: "4Gi" cpu: "1" limits: memory: "8Gi" cpu: "2" volumeClaimTemplates: # Data: Fast SSD, highest performance tier - metadata: name: data labels: tier: premium spec: accessModes: ["ReadWriteOnce"] storageClassName: fast-ssd resources: requests: storage: 500Gi # Logs: Standard storage, acceptable to lose - metadata: name: logs labels: tier: standard spec: accessModes: ["ReadWriteOnce"] storageClassName: standard resources: requests: storage: 50Gi # Snapshots: Cold storage for backups - metadata: name: snapshots labels: tier: archive spec: accessModes: ["ReadWriteOnce"] storageClassName: cold-storage resources: requests: storage: 1000GiFor temporary data that doesn't need persistence (caches, temp files, scratch space), use emptyDir volumes instead of volumeClaimTemplates. This avoids unnecessary PVC creation and storage costs.
Running stateful applications in production requires attention to several operational concerns beyond basic StatefulSet configuration.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
# Production-ready StatefulSet configurationapiVersion: apps/v1kind: StatefulSetmetadata: name: production-mysqlspec: serviceName: mysql replicas: 3 # Conservative update strategy - manual control updateStrategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 # partition: 2 # Uncomment to update only pods >= ordinal 2 # Preserve PVCs on all operations persistentVolumeClaimRetentionPolicy: whenDeleted: Retain whenScaled: Retain # Minimum ready time before available minReadySeconds: 30 selector: matchLabels: app: mysql template: metadata: labels: app: mysql spec: # Graceful termination time terminationGracePeriodSeconds: 120 # Spread pods across nodes affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: mysql topologyKey: kubernetes.io/hostname containers: - name: mysql image: mysql:8.0 lifecycle: preStop: exec: # Graceful flush before shutdown command: ["/bin/sh", "-c", "mysqladmin shutdown -uroot -p\$MYSQL_ROOT_PASSWORD"] readinessProbe: exec: command: ["mysqladmin", "ping", "-uroot", "-p$(MYSQL_ROOT_PASSWORD)"] initialDelaySeconds: 15 periodSeconds: 5 timeoutSeconds: 3 livenessProbe: exec: command: ["mysqladmin", "ping", "-uroot", "-p$(MYSQL_ROOT_PASSWORD)"] initialDelaySeconds: 60 periodSeconds: 30 timeoutSeconds: 5 resources: requests: memory: "2Gi" cpu: "1" limits: memory: "8Gi" cpu: "4" volumeMounts: - name: data mountPath: /var/lib/mysql volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] storageClassName: fast-ssd resources: requests: storage: 200Gi ---# PDB to maintain quorumapiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: mysql-pdbspec: minAvailable: 2 selector: matchLabels: app: mysqlStatefulSet storage provides the stable, persistent storage foundation that stateful applications require in Kubernetes. The combination of volumeClaimTemplates, ordered operations, and stable identity enables reliable database and stateful service deployments.
What's next:
We'll explore cloud provider integration for Kubernetes storage—how AWS EBS, Google Persistent Disk, Azure Disk, and associated CSI drivers integrate with Storage Classes and PVs. Understanding cloud-specific behavior is essential for production deployments.
You now understand StatefulSet storage comprehensively—from volumeClaimTemplates through ordered provisioning, scaling behaviors, PVC retention policies, recovery patterns, and production considerations. This knowledge enables reliable stateful application deployments in Kubernetes.