Kubernetes Workloads - Learning Module

Loading content...

0/273

StatefulSets for Stateful Applications

When Pods Need Identity

Not every application fits the stateless mold. Databases, message brokers, distributed caches, and consensus systems all share a common requirement: they need to be treated as individuals, not as interchangeable units in a pool.

Consider what happens when you run a PostgreSQL primary-replica setup. The primary instance must know it's the primary. Replicas need stable addresses to connect to the primary. Each instance requires its own persistent storage that survives pod restarts. Replacing a failed replica isn't as simple as spinning up a new, anonymous pod—it needs the same identity, the same network address, and access to the same data.

This is where StatefulSets enter the picture. They provide the guarantees that stateful applications demand while still leveraging Kubernetes' orchestration capabilities.

What You Will Learn

By the end of this page, you'll understand when and why to use StatefulSets instead of Deployments. You'll master the concepts of stable pod identities, headless services, persistent volume claims, and ordered operations. You'll also learn the operational patterns for running production databases and distributed systems on Kubernetes.

The Stateful Application Challenge

To understand why StatefulSets exist, we must first understand what Deployments cannot provide—even with persistent volumes attached.

The Deployment limitations for stateful workloads:

Deployments treat all pods as identical and interchangeable. This creates several problems for stateful applications:

Why Deployments Fall Short for Stateful Workloads
Requirement	Deployment Behavior	Why It's a Problem
Stable network identity	Pod names are random (e.g., api-7d8cf4d9-xz4rw)	Other pods can't reliably connect; DNS names change on restart
Persistent storage per pod	All pods can mount same volume or get random PVCs	Data isn't associated with specific pod identity
Ordered startup	All pods start simultaneously	Leader election, data initialization may conflict
Ordered shutdown	Pods terminate in arbitrary order	Graceful cluster shutdown requires reverse order
Ordered updates	Pods update with configurable parallelism	Rolling updates may disrupt quorum in distributed systems

Real-world example: A Kafka cluster gone wrong

Imagine running a 3-broker Kafka cluster using a Deployment:

Broker pods are named kafka-abc123, kafka-def456, kafka-ghi789
Each broker registers with ZooKeeper using its pod name
A pod crashes and Kubernetes replaces it with kafka-jkl012
ZooKeeper now has four broker registrations—three stale, one active
Clients trying to connect to kafka-abc123 hit DNS resolution failures
Partition leadership becomes confused
The cluster is in an inconsistent state

With a StatefulSet, the replacement pod would retain the original name kafka-0, kafka-1, or kafka-2, reconnect to its original persistent storage, and resume its role seamlessly.

Not Just About Storage

A common misconception is that StatefulSets are simply "Deployments with persistent volumes." This misses the point. The key differentiator is stable identity—the persistent, predictable naming that allows distributed systems to maintain their coordination state across pod restarts and rescheduling.

StatefulSet Guarantees

StatefulSets provide specific guarantees that collectively enable reliable operation of stateful applications. Understanding these guarantees—and their limitations—is crucial for correct usage.

The Five Core StatefulSet Guarantees

•Stable, unique network identifiers — Each pod gets a predictable hostname following the pattern $(statefulset-name)-$(ordinal). Pod mysql-0 will always be mysql-0, even if it's rescheduled to a different node.
•Stable, persistent storage — Each pod gets its own PersistentVolumeClaim that persists across rescheduling. When mysql-0 restarts, it reconnects to the same persistent volume.
•Ordered, graceful deployment — Pods are created in sequential order (0, 1, 2, ...). Each pod must be Running and Ready before the next one starts.
•Ordered, graceful termination — Pods are terminated in reverse order (..., 2, 1, 0). This ensures replicas shut down before primaries in leader-follower topologies.
•Ordered, rolling updates — Updates proceed in reverse ordinal order. This reduces disruption in systems where lower-ordinal pods are typically more important.

Converting Mermaid diagram...

Headless Services: The Key to Stable DNS

StatefulSets rely on a headless service (a service with clusterIP: None) to provide stable DNS names for each pod. Unlike regular services that load-balance across pods, headless services return the individual pod IPs, enabling direct addressing.

The DNS naming pattern for StatefulSet pods:

$(pod-name).$(headless-service-name).$(namespace).svc.cluster.local

For a StatefulSet named mysql with headless service mysql-headless in the databases namespace:

mysql-0.mysql-headless.databases.svc.cluster.local
mysql-1.mysql-headless.databases.svc.cluster.local
mysql-2.mysql-headless.databases.svc.cluster.local

Complete StatefulSet Configuration

Let's examine a production-grade StatefulSet configuration for a distributed database. We'll break down each section to understand its purpose:

statefulset.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# Headless Service - Required for stable network identity
apiVersion: v1
kind: Service
metadata:
  name: mysql-headless
  namespace: databases
  labels:
    app: mysql
spec:
  clusterIP: None  # This makes it a headless service
  ports:
  - port: 3306
    name: mysql
  - port: 33060
    name: mysqlx
  selector:
    app: mysql
---
# Regular Service - For client access with load balancing
apiVersion: v1
kind: Service
metadata:
  name: mysql
  namespace: databases
spec:
  ports:
  - port: 3306
    name: mysql
  selector:
    app: mysql
    role: primary  # Only routes to primary
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
  namespace: databases
spec:
  serviceName: mysql-headless  # Links to headless service
  replicas: 3
  
  # === Pod Management Policy ===
  podManagementPolicy: OrderedReady  # Default: sequential startup
  # Alternative: Parallel - for faster scaling when order doesn't matter
  
  # === Update Strategy ===
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0  # Update all pods (set higher to canary)
  
  # === Selector ===
  selector:
    matchLabels:
      app: mysql
  
  # === Pod Template ===
  template:
    metadata:
      labels:
        app: mysql
      annotations:
        prometheus.io/scrape: "true"
    spec:
      terminationGracePeriodSeconds: 120  # Give MySQL time to flush
      
      # === Init Container: Configure based on ordinal ===
      initContainers:
      - name: init-mysql
        image: mysql:8.0
        command:
        - bash
        - "-c"
        - |
          set -ex
          # Extract ordinal from hostname (mysql-0 -> 0)
          [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
          ordinal=${BASH_REMATCH[1]}
          
          # Generate server-id from ordinal
          echo "[mysqld]" > /mnt/conf/server-id.cnf
          echo "server-id=$((100 + ordinal))" >> /mnt/conf/server-id.cnf
          
          # First pod is primary, others are replicas
          if [[ $ordinal -eq 0 ]]; then
            cp /mnt/config-map/primary.cnf /mnt/conf/
          else
            cp /mnt/config-map/replica.cnf /mnt/conf/
          fi
        volumeMounts:
        - name: conf
          mountPath: /mnt/conf
        - name: config-map
          mountPath: /mnt/config-map
      
      # === Main Container ===
      containers:
      - name: mysql
        image: mysql:8.0
        ports:
        - containerPort: 3306
          name: mysql
        
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"
          limits:
            cpu: "2"
            memory: "4Gi"
        
        env:
        - name: MYSQL_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mysql-secrets
              key: root-password
        
        livenessProbe:
          exec:
            command: ["mysqladmin", "ping"]
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
        
        readinessProbe:
          exec:
            command: 
            - bash
            - "-c"
            - |
              mysql -h 127.0.0.1 -uroot -p$MYSQL_ROOT_PASSWORD -e "SELECT 1"
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 2
        
        volumeMounts:
        - name: data
          mountPath: /var/lib/mysql
        - name: conf
          mountPath: /etc/mysql/conf.d
      
      volumes:
      - name: conf
        emptyDir: {}
      - name: config-map
        configMap:
          name: mysql-config
  
  # === Persistent Volume Claim Templates ===
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: premium-ssd
      resources:
        requests:
          storage: 100Gi

Critical Configuration Elements

•serviceName — Must match the headless service name. This is how StatefulSet knows which service provides DNS for its pods.
•podManagementPolicy — OrderedReady (default) ensures sequential operations. Use Parallel only when pods don't depend on each other's state.
•updateStrategy.partition — When set to N, only pods with ordinal >= N are updated. Essential for staged rollouts in distributed systems.
•volumeClaimTemplates — Creates a unique PVC for each pod. PVCs are named $(volumeClaimTemplate.name)-$(pod-name), e.g., data-mysql-0.
•Init containers — Use the pod's ordinal (extracted from hostname) to configure role-specific behavior. Pod-0 might be primary, others replicas.
•terminationGracePeriodSeconds — Longer than Deployments because stateful applications need time to flush buffers, close connections, and synchronize state.

Persistent Storage Deep Dive

StatefulSets have a unique relationship with storage that differs fundamentally from Deployments. Understanding this relationship is critical for data safety and operational planning.

Converting Mermaid diagram...

The PVC Retention Behavior:

This is one of the most important yet misunderstood aspects of StatefulSets:

PVCs are NOT deleted when pods are scaled down — This prevents accidental data loss but also means you must explicitly clean up unused PVCs.
PVCs are NOT deleted when the StatefulSet is deleted — Even kubectl delete sts mysql leaves all PVCs intact. This is a safety feature.
Scaling back up reuses existing PVCs — If you scale down from 3 to 2, then back to 3, the new mysql-2 pod gets the original data-mysql-2 PVC with all its data.

PVC Cleanup is Your Responsibility

After scaling down a StatefulSet, you must manually delete unused PVCs if you want to reclaim storage. Kubernetes 1.27+ introduced a persistentVolumeClaimRetentionPolicy field that can automate this, but the default remains "Retain" for safety. Always verify data is backed up before deleting PVCs.

pvc-retention-policy.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Kubernetes 1.27+ PVC Retention Policy
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
spec:
  # ...
  persistentVolumeClaimRetentionPolicy:
    # What happens when StatefulSet is deleted
    whenDeleted: Retain  # Default: keep PVCs
    # Alternative: Delete - delete PVCs with StatefulSet
    
    # What happens when replica count is reduced
    whenScaled: Retain  # Default: keep PVCs of scaled-down pods
    # Alternative: Delete - delete PVCs immediately on scale-down

Storage Class Considerations for StatefulSets
Feature	Requirement	Recommendation
Access Mode	ReadWriteOnce (RWO)	Standard for single-pod attachment
Volume Binding	WaitForFirstConsumer	Ensures PV is provisioned in pod's zone
Expand	allowVolumeExpansion: true	Enables online storage growth
Reclaim Policy	Delete or Retain	Retain for production, Delete for dev
IOPS/Throughput	Match workload needs	Premium SSD for databases

Ordered Operations

The ordered nature of StatefulSet operations is crucial for distributed systems that require coordination. Let's examine how ordering works and when to modify it.

OrderedReady (Default)

•Scale up: 0 → 1 → 2 sequentially
•Scale down: 2 → 1 → 0 sequentially
•Update: 2 → 1 → 0 sequentially
•Waits for Running + Ready before next
•Use for: databases, Kafka, ZooKeeper

Parallel

•Scale up: all pods simultaneously
•Scale down: all pods simultaneously
•Still preserves identity and storage
•Much faster for large deployments
•Use for: stateless with stable storage needs

Understanding Update Partitions:

The partition field in the update strategy is a powerful tool for staged rollouts. When partition is set to N, pods with ordinal >= N are updated, while pods with ordinal < N remain on the old version.

partitioned-updates.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Scenario: 5-node Elasticsearch cluster (es-0 through es-4)
# Goal: Canary update on last 2 nodes first
 
# Step 1: Set partition to 3 (pods 3,4 will update; pods 0,1,2 won't)
kubectl patch statefulset es --type='json' -p='[
  {"op": "replace", "path": "/spec/updateStrategy/rollingUpdate/partition", "value": 3}
]'
 
# Step 2: Update the image (only es-3 and es-4 update)
kubectl set image statefulset/es elasticsearch=elasticsearch:8.11.1
 
# Step 3: Validate cluster health
kubectl exec es-0 -- curl -s localhost:9200/_cluster/health
 
# Step 4: Lower partition to update more nodes
kubectl patch statefulset es --type='json' -p='[
  {"op": "replace", "path": "/spec/updateStrategy/rollingUpdate/partition", "value": 0}
]'
 
# Now all nodes update in order: es-2, es-1, es-0

Partition Strategy for Large Clusters

For production distributed systems, use partitions to implement a staged rollout: (1) Update the highest ordinal pod first, (2) Verify cluster health, (3) Gradually lower the partition, (4) Monitor cluster behavior at each stage. This gives you multiple opportunities to catch issues before affecting your primary/leader nodes (typically the lowest ordinals).

Production Patterns for Stateful Workloads

Running stateful applications in production requires patterns beyond basic StatefulSet configuration. Here are battle-tested approaches for common scenarios:

Leader Election with Pod Ordinal

Many applications use the StatefulSet ordinal for leader election:

Pod-0 is the primary/leader by convention
Higher ordinals are followers/replicas
Init containers configure roles based on hostname
Applications use sidecars or operators for failover

leader-election.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
initContainers:
- name: configure-role
  image: busybox:1.36
  command:
  - sh
  - -c
  - |
    ORDINAL=$(echo $HOSTNAME | grep -o '[0-9]*$')
    if [ "$ORDINAL" = "0" ]; then
      echo "primary" > /config/role
      echo "ROLE=primary" >> /config/env
    else
      echo "replica" > /config/role
      echo "ROLE=replica" >> /config/env
      echo "PRIMARY_HOST=db-0.db-headless" >> /config/env
    fi
  volumeMounts:
  - name: config
    mountPath: /config

Consider Operators for Complex Stateful Applications

For production databases, consider using Kubernetes Operators instead of manually managing StatefulSets. Operators like PostgreSQL Operator (Zalando/CrunchyData), MySQL Operator (Oracle/Percona), or Strimzi (Kafka) encode operational knowledge—backup, failover, scaling, upgrades—into automated controllers that reduce human error and operational burden.

Debugging StatefulSets

StatefulSet issues often involve storage, ordering, or identity—dimensions that don't exist with Deployments. Here's a systematic debugging approach:

debug-statefulset.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 1. Check StatefulSet status and conditions
kubectl describe statefulset mysql
 
# Look for:
# - CurrentReplicas vs ReadyReplicas
# - UpdateRevision vs CurrentRevision (update progress)
# - Events indicating failures
 
# 2. Check pod ordering and state
kubectl get pods -l app=mysql -o wide
# Pods should be sequentially numbered
# Earlier pods should be Ready before later ones start
 
# 3. Verify PVC bindings
kubectl get pvc -l app=mysql
# Each PVC should be Bound to a PV
# PVC names should match pattern: data-mysql-N
 
# 4. Check PV provisioning
kubectl describe pvc data-mysql-0
# Look for: Events, StorageClass, VolumeMode
 
# 5. Debug ordering issues
kubectl get pods -l app=mysql -w
# Watch pods during scale up/down
# If pod-N starts before pod-N-1 is Ready, there's an issue
 
# 6. Verify headless service DNS
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup mysql-headless
# Should return A records for each pod

Common StatefulSet Issues and Solutions
Symptom	Likely Cause	Solution
Pods stuck in Pending	PVC cannot be provisioned (no PV, wrong zone)	Check StorageClass, CSI driver logs, PVC events
Scale-up blocked at pod N	Pod N-1 not Ready	Debug pod N-1's readiness probe, logs
DNS resolution fails for pod	Headless service misconfigured	Verify serviceName matches headless service name
Pod restarts with fresh data	PVC not correctly bound	Check volumeMounts, PVC name pattern
Update stuck on pod N	Partition set too high	Lower updateStrategy.partition value
Old pods not deleting during update	Pods repeatedly failing readiness	Check logs, fix application issues

Summary: StatefulSets for Stateful Applications

Let's consolidate the essential knowledge about StatefulSets:

Key Takeaways

•StatefulSets provide stable identity — Predictable pod names, stable DNS, and persistent storage associations that survive rescheduling.
•Headless services enable DNS discovery — Each pod gets a resolvable DNS name following pattern pod-name.service-name.namespace.svc.cluster.local.
•PVCs persist across pod lifecycle — Storage is not deleted when pods terminate or scale down. Manual cleanup is required.
•Operations are ordered by default — Scale-up, scale-down, and updates proceed sequentially. Use Parallel when order doesn't matter.
•Partitions enable staged rollouts — Set updateStrategy.partition to N to only update pods with ordinal >= N.
•Use operators for complex databases — Operators encode operational best practices and reduce the risk of human error.
•Backup requires application awareness — PV snapshots alone may not provide consistent backups for databases.

What's next:

Now that you understand StatefulSets for applications requiring stable identity and storage, we'll explore DaemonSets—the workload type for running exactly one pod on every (or selected) node. DaemonSets are essential for node-level concerns like logging agents, monitoring exporters, and network plugins.

Page Complete

You now have a comprehensive understanding of StatefulSets. You can configure stateful applications with stable identities, persistent storage, and ordered operations. You understand the storage lifecycle, partition-based updates, and operational patterns for distributed systems. Next, we'll explore DaemonSets for node-level workloads.

StatefulSets for Stateful Applications

When Pods Need Identity

This is where StatefulSets enter the picture. They provide the guarantees that stateful applications demand while still leveraging Kubernetes' orchestration capabilities.

What You Will Learn

The Stateful Application Challenge

To understand why StatefulSets exist, we must first understand what Deployments cannot provide—even with persistent volumes attached.

The Deployment limitations for stateful workloads:

Deployments treat all pods as identical and interchangeable. This creates several problems for stateful applications:

Why Deployments Fall Short for Stateful Workloads
Requirement	Deployment Behavior	Why It's a Problem
Stable network identity	Pod names are random (e.g., api-7d8cf4d9-xz4rw)	Other pods can't reliably connect; DNS names change on restart
Persistent storage per pod	All pods can mount same volume or get random PVCs	Data isn't associated with specific pod identity
Ordered startup	All pods start simultaneously	Leader election, data initialization may conflict
Ordered shutdown	Pods terminate in arbitrary order	Graceful cluster shutdown requires reverse order
Ordered updates	Pods update with configurable parallelism	Rolling updates may disrupt quorum in distributed systems

Real-world example: A Kafka cluster gone wrong

Imagine running a 3-broker Kafka cluster using a Deployment:

Broker pods are named kafka-abc123, kafka-def456, kafka-ghi789
Each broker registers with ZooKeeper using its pod name
A pod crashes and Kubernetes replaces it with kafka-jkl012
ZooKeeper now has four broker registrations—three stale, one active
Clients trying to connect to kafka-abc123 hit DNS resolution failures
Partition leadership becomes confused
The cluster is in an inconsistent state

With a StatefulSet, the replacement pod would retain the original name kafka-0, kafka-1, or kafka-2, reconnect to its original persistent storage, and resume its role seamlessly.

Not Just About Storage

StatefulSet Guarantees

StatefulSets provide specific guarantees that collectively enable reliable operation of stateful applications. Understanding these guarantees—and their limitations—is crucial for correct usage.

The Five Core StatefulSet Guarantees

•Stable, unique network identifiers — Each pod gets a predictable hostname following the pattern $(statefulset-name)-$(ordinal). Pod mysql-0 will always be mysql-0, even if it's rescheduled to a different node.
•Stable, persistent storage — Each pod gets its own PersistentVolumeClaim that persists across rescheduling. When mysql-0 restarts, it reconnects to the same persistent volume.
•Ordered, graceful deployment — Pods are created in sequential order (0, 1, 2, ...). Each pod must be Running and Ready before the next one starts.
•Ordered, graceful termination — Pods are terminated in reverse order (..., 2, 1, 0). This ensures replicas shut down before primaries in leader-follower topologies.
•Ordered, rolling updates — Updates proceed in reverse ordinal order. This reduces disruption in systems where lower-ordinal pods are typically more important.

Converting Mermaid diagram...

Headless Services: The Key to Stable DNS

The DNS naming pattern for StatefulSet pods:

$(pod-name).$(headless-service-name).$(namespace).svc.cluster.local

For a StatefulSet named mysql with headless service mysql-headless in the databases namespace:

mysql-0.mysql-headless.databases.svc.cluster.local
mysql-1.mysql-headless.databases.svc.cluster.local
mysql-2.mysql-headless.databases.svc.cluster.local

Complete StatefulSet Configuration

Let's examine a production-grade StatefulSet configuration for a distributed database. We'll break down each section to understand its purpose:

statefulset.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# Headless Service - Required for stable network identity
apiVersion: v1
kind: Service
metadata:
  name: mysql-headless
  namespace: databases
  labels:
    app: mysql
spec:
  clusterIP: None  # This makes it a headless service
  ports:
  - port: 3306
    name: mysql
  - port: 33060
    name: mysqlx
  selector:
    app: mysql
---
# Regular Service - For client access with load balancing
apiVersion: v1
kind: Service
metadata:
  name: mysql
  namespace: databases
spec:
  ports:
  - port: 3306
    name: mysql
  selector:
    app: mysql
    role: primary  # Only routes to primary
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
  namespace: databases
spec:
  serviceName: mysql-headless  # Links to headless service
  replicas: 3
  
  # === Pod Management Policy ===
  podManagementPolicy: OrderedReady  # Default: sequential startup
  # Alternative: Parallel - for faster scaling when order doesn't matter
  
  # === Update Strategy ===
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0  # Update all pods (set higher to canary)
  
  # === Selector ===
  selector:
    matchLabels:
      app: mysql
  
  # === Pod Template ===
  template:
    metadata:
      labels:
        app: mysql
      annotations:
        prometheus.io/scrape: "true"
    spec:
      terminationGracePeriodSeconds: 120  # Give MySQL time to flush
      
      # === Init Container: Configure based on ordinal ===
      initContainers:
      - name: init-mysql
        image: mysql:8.0
        command:
        - bash
        - "-c"
        - |
          set -ex
          # Extract ordinal from hostname (mysql-0 -> 0)
          [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
          ordinal=${BASH_REMATCH[1]}
          
          # Generate server-id from ordinal
          echo "[mysqld]" > /mnt/conf/server-id.cnf
          echo "server-id=$((100 + ordinal))" >> /mnt/conf/server-id.cnf
          
          # First pod is primary, others are replicas
          if [[ $ordinal -eq 0 ]]; then
            cp /mnt/config-map/primary.cnf /mnt/conf/
          else
            cp /mnt/config-map/replica.cnf /mnt/conf/
          fi
        volumeMounts:
        - name: conf
          mountPath: /mnt/conf
        - name: config-map
          mountPath: /mnt/config-map
      
      # === Main Container ===
      containers:
      - name: mysql
        image: mysql:8.0
        ports:
        - containerPort: 3306
          name: mysql
        
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"
          limits:
            cpu: "2"
            memory: "4Gi"
        
        env:
        - name: MYSQL_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mysql-secrets
              key: root-password
        
        livenessProbe:
          exec:
            command: ["mysqladmin", "ping"]
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
        
        readinessProbe:
          exec:
            command: 
            - bash
            - "-c"
            - |
              mysql -h 127.0.0.1 -uroot -p$MYSQL_ROOT_PASSWORD -e "SELECT 1"
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 2
        
        volumeMounts:
        - name: data
          mountPath: /var/lib/mysql
        - name: conf
          mountPath: /etc/mysql/conf.d
      
      volumes:
      - name: conf
        emptyDir: {}
      - name: config-map
        configMap:
          name: mysql-config
  
  # === Persistent Volume Claim Templates ===
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: premium-ssd
      resources:
        requests:
          storage: 100Gi

Critical Configuration Elements

•serviceName — Must match the headless service name. This is how StatefulSet knows which service provides DNS for its pods.
•podManagementPolicy — OrderedReady (default) ensures sequential operations. Use Parallel only when pods don't depend on each other's state.
•updateStrategy.partition — When set to N, only pods with ordinal >= N are updated. Essential for staged rollouts in distributed systems.
•volumeClaimTemplates — Creates a unique PVC for each pod. PVCs are named $(volumeClaimTemplate.name)-$(pod-name), e.g., data-mysql-0.
•Init containers — Use the pod's ordinal (extracted from hostname) to configure role-specific behavior. Pod-0 might be primary, others replicas.
•terminationGracePeriodSeconds — Longer than Deployments because stateful applications need time to flush buffers, close connections, and synchronize state.

Persistent Storage Deep Dive

StatefulSets have a unique relationship with storage that differs fundamentally from Deployments. Understanding this relationship is critical for data safety and operational planning.

Converting Mermaid diagram...

The PVC Retention Behavior:

This is one of the most important yet misunderstood aspects of StatefulSets:

PVCs are NOT deleted when pods are scaled down — This prevents accidental data loss but also means you must explicitly clean up unused PVCs.
PVCs are NOT deleted when the StatefulSet is deleted — Even kubectl delete sts mysql leaves all PVCs intact. This is a safety feature.
Scaling back up reuses existing PVCs — If you scale down from 3 to 2, then back to 3, the new mysql-2 pod gets the original data-mysql-2 PVC with all its data.

PVC Cleanup is Your Responsibility

pvc-retention-policy.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Kubernetes 1.27+ PVC Retention Policy
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
spec:
  # ...
  persistentVolumeClaimRetentionPolicy:
    # What happens when StatefulSet is deleted
    whenDeleted: Retain  # Default: keep PVCs
    # Alternative: Delete - delete PVCs with StatefulSet
    
    # What happens when replica count is reduced
    whenScaled: Retain  # Default: keep PVCs of scaled-down pods
    # Alternative: Delete - delete PVCs immediately on scale-down

Storage Class Considerations for StatefulSets
Feature	Requirement	Recommendation
Access Mode	ReadWriteOnce (RWO)	Standard for single-pod attachment
Volume Binding	WaitForFirstConsumer	Ensures PV is provisioned in pod's zone
Expand	allowVolumeExpansion: true	Enables online storage growth
Reclaim Policy	Delete or Retain	Retain for production, Delete for dev
IOPS/Throughput	Match workload needs	Premium SSD for databases

Ordered Operations

The ordered nature of StatefulSet operations is crucial for distributed systems that require coordination. Let's examine how ordering works and when to modify it.

OrderedReady (Default)

•Scale up: 0 → 1 → 2 sequentially
•Scale down: 2 → 1 → 0 sequentially
•Update: 2 → 1 → 0 sequentially
•Waits for Running + Ready before next
•Use for: databases, Kafka, ZooKeeper

Parallel

•Scale up: all pods simultaneously
•Scale down: all pods simultaneously
•Still preserves identity and storage
•Much faster for large deployments
•Use for: stateless with stable storage needs

Understanding Update Partitions:

partitioned-updates.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Scenario: 5-node Elasticsearch cluster (es-0 through es-4)
# Goal: Canary update on last 2 nodes first
 
# Step 1: Set partition to 3 (pods 3,4 will update; pods 0,1,2 won't)
kubectl patch statefulset es --type='json' -p='[
  {"op": "replace", "path": "/spec/updateStrategy/rollingUpdate/partition", "value": 3}
]'
 
# Step 2: Update the image (only es-3 and es-4 update)
kubectl set image statefulset/es elasticsearch=elasticsearch:8.11.1
 
# Step 3: Validate cluster health
kubectl exec es-0 -- curl -s localhost:9200/_cluster/health
 
# Step 4: Lower partition to update more nodes
kubectl patch statefulset es --type='json' -p='[
  {"op": "replace", "path": "/spec/updateStrategy/rollingUpdate/partition", "value": 0}
]'
 
# Now all nodes update in order: es-2, es-1, es-0

Partition Strategy for Large Clusters

Production Patterns for Stateful Workloads

Running stateful applications in production requires patterns beyond basic StatefulSet configuration. Here are battle-tested approaches for common scenarios:

Leader Election with Pod Ordinal

Many applications use the StatefulSet ordinal for leader election:

Pod-0 is the primary/leader by convention
Higher ordinals are followers/replicas
Init containers configure roles based on hostname
Applications use sidecars or operators for failover

leader-election.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
initContainers:
- name: configure-role
  image: busybox:1.36
  command:
  - sh
  - -c
  - |
    ORDINAL=$(echo $HOSTNAME | grep -o '[0-9]*$')
    if [ "$ORDINAL" = "0" ]; then
      echo "primary" > /config/role
      echo "ROLE=primary" >> /config/env
    else
      echo "replica" > /config/role
      echo "ROLE=replica" >> /config/env
      echo "PRIMARY_HOST=db-0.db-headless" >> /config/env
    fi
  volumeMounts:
  - name: config
    mountPath: /config

Consider Operators for Complex Stateful Applications

Debugging StatefulSets

StatefulSet issues often involve storage, ordering, or identity—dimensions that don't exist with Deployments. Here's a systematic debugging approach:

debug-statefulset.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 1. Check StatefulSet status and conditions
kubectl describe statefulset mysql
 
# Look for:
# - CurrentReplicas vs ReadyReplicas
# - UpdateRevision vs CurrentRevision (update progress)
# - Events indicating failures
 
# 2. Check pod ordering and state
kubectl get pods -l app=mysql -o wide
# Pods should be sequentially numbered
# Earlier pods should be Ready before later ones start
 
# 3. Verify PVC bindings
kubectl get pvc -l app=mysql
# Each PVC should be Bound to a PV
# PVC names should match pattern: data-mysql-N
 
# 4. Check PV provisioning
kubectl describe pvc data-mysql-0
# Look for: Events, StorageClass, VolumeMode
 
# 5. Debug ordering issues
kubectl get pods -l app=mysql -w
# Watch pods during scale up/down
# If pod-N starts before pod-N-1 is Ready, there's an issue
 
# 6. Verify headless service DNS
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup mysql-headless
# Should return A records for each pod

Common StatefulSet Issues and Solutions
Symptom	Likely Cause	Solution
Pods stuck in Pending	PVC cannot be provisioned (no PV, wrong zone)	Check StorageClass, CSI driver logs, PVC events
Scale-up blocked at pod N	Pod N-1 not Ready	Debug pod N-1's readiness probe, logs
DNS resolution fails for pod	Headless service misconfigured	Verify serviceName matches headless service name
Pod restarts with fresh data	PVC not correctly bound	Check volumeMounts, PVC name pattern
Update stuck on pod N	Partition set too high	Lower updateStrategy.partition value
Old pods not deleting during update	Pods repeatedly failing readiness	Check logs, fix application issues

Summary: StatefulSets for Stateful Applications

Let's consolidate the essential knowledge about StatefulSets:

Key Takeaways

•StatefulSets provide stable identity — Predictable pod names, stable DNS, and persistent storage associations that survive rescheduling.
•Headless services enable DNS discovery — Each pod gets a resolvable DNS name following pattern pod-name.service-name.namespace.svc.cluster.local.
•PVCs persist across pod lifecycle — Storage is not deleted when pods terminate or scale down. Manual cleanup is required.
•Operations are ordered by default — Scale-up, scale-down, and updates proceed sequentially. Use Parallel when order doesn't matter.
•Partitions enable staged rollouts — Set updateStrategy.partition to N to only update pods with ordinal >= N.
•Use operators for complex databases — Operators encode operational best practices and reduce the risk of human error.
•Backup requires application awareness — PV snapshots alone may not provide consistent backups for databases.

What's next:

Page Complete