System Design (HLD)Chaos Tools

Chaos Engineering Tools

LevelAdvanced

Duration120 mins

TopicChaos Tools

3 / 5

LitmusChaos: Kubernetes-Native Chaos Engineering

Chaos Engineering for the Cloud-Native Era

Kubernetes has become the de facto standard for container orchestration, transforming how organizations deploy and manage applications. But Kubernetes introduces its own complexity: pods, deployments, services, ingresses, operators, custom resources—layers upon layers of abstraction that create new failure modes traditional chaos tools weren't designed to address.

LitmusChaos emerged to bring chaos engineering natively into the Kubernetes ecosystem.

Developed by MayaData (now part of DataCore) and donated to the Cloud Native Computing Foundation (CNCF), LitmusChaos is built on Kubernetes primitives from the ground up. It speaks Kubernetes-native: Custom Resource Definitions (CRDs), operators, and declarative YAML—the language platform engineers already know.

What You Will Learn

By the end of this page, you will understand LitmusChaos architecture and operator model, master chaos experiments and workflows using CRDs, learn to leverage ChaosHub for community-sourced experiments, and integrate LitmusChaos into GitOps and CI/CD pipelines.

Why Kubernetes-Native Chaos?

Traditional chaos engineering tools like Chaos Monkey were designed for virtual machines and bare-metal servers. While they can work in Kubernetes environments (by targeting nodes or pods as VMs), they miss the abstractions that make Kubernetes unique—and where many failures actually occur.

Kubernetes-Specific Failure Modes
Failure Mode	Traditional Chaos Approach	Kubernetes-Native Approach
Pod eviction	Terminate container process	Evict via Kubernetes API, test rescheduling
Service discovery failure	Block DNS	Delete Endpoints, test service mesh recovery
Resource quota exhaustion	Consume resources	Create competing pods, test scheduler behavior
Network policy misconfiguration	Block network	Validate NetworkPolicy failures
Persistent volume issues	Fill disk	Detach PV, test StatefulSet recovery
Node drain	Shutdown VM	kubectl drain, test pod migration

The Kubernetes failure surface:

Kubernetes adds multiple abstraction layers, each with potential failure points:

Kubernetes Abstraction Layers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
┌─────────────────────────────────────────────────────────────────────┐
│                    KUBERNETES FAILURE SURFACE                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Layer 1: APPLICATION                                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Container crashes, OOM kills, application errors           │   │
│  │  Health check failures, readiness probe issues              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  Layer 2: POD                                                        │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Pod eviction, pod preemption, init container failures      │   │
│  │  Sidecar issues, volume mount failures                      │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  Layer 3: WORKLOAD CONTROLLERS                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Deployment rollout failures, ReplicaSet issues             │   │
│  │  StatefulSet ordering violations, DaemonSet gaps            │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  Layer 4: SERVICES & NETWORKING                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Service routing failures, Ingress misconfigurations        │   │
│  │  NetworkPolicy blocks, DNS resolution failures              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  Layer 5: CLUSTER INFRASTRUCTURE                                    │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Node failures, etcd issues, API server unavailability      │   │
│  │  Scheduler failures, controller-manager issues              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  Layer 6: CLOUD PROVIDER                                            │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Load balancer issues, PersistentVolume provisioning        │   │
│  │  Cloud controller failures, CSI driver problems             │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Native Integration Benefits

A Kubernetes-native chaos tool can target Kubernetes abstractions directly—Deployments, Services, StatefulSets—rather than requiring users to translate Kubernetes concepts into infrastructure primitives. This reduces cognitive overhead and catches failures that only manifest at the Kubernetes layer.

LitmusChaos Architecture

LitmusChaos follows the Kubernetes operator pattern, using Custom Resource Definitions (CRDs) to represent chaos experiments as native Kubernetes objects. This architecture enables GitOps workflows, Kubernetes RBAC integration, and familiar kubectl-based operations.

LitmusChaos Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
┌─────────────────────────────────────────────────────────────────────┐
│                     LITMUSCHAOS ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    LITMUS PORTAL (UI)                         │  │
│  │  ┌────────────┐  ┌────────────┐  ┌─────────────────────────┐ │  │
│  │  │ Experiment │  │  Workflow  │  │   Analytics &           │ │  │
│  │  │  Designer  │  │  Builder   │  │   Observability         │ │  │
│  │  └────────────┘  └────────────┘  └─────────────────────────┘ │  │
│  └─────────────────────────┬────────────────────────────────────┘  │
│                            │                                        │
│  ┌─────────────────────────▼────────────────────────────────────┐  │
│  │                   CHAOS CENTER (Backend)                      │  │
│  │  ┌────────────┐  ┌────────────┐  ┌─────────────────────────┐ │  │
│  │  │  GraphQL   │  │  Mongo DB  │  │   Authentication &      │ │  │
│  │  │    API     │  │  (State)   │  │   Authorization         │ │  │
│  │  └────────────┘  └────────────┘  └─────────────────────────┘ │  │
│  └─────────────────────────┬────────────────────────────────────┘  │
│                            │                                        │
│  ════════════════════════════════════════════════════════════════  │
│                         CLUSTER BOUNDARY                            │
│  ════════════════════════════════════════════════════════════════  │
│                            │                                        │
│  ┌─────────────────────────▼────────────────────────────────────┐  │
│  │              SUBSCRIBER (Cluster Agent)                       │  │
│  │  • Connects cluster to Chaos Center                          │  │
│  │  • Receives workflow execution requests                      │  │
│  │  • Reports experiment results back                           │  │
│  └─────────────────────────┬────────────────────────────────────┘  │
│                            │                                        │
│  ┌─────────────────────────▼────────────────────────────────────┐  │
│  │              CHAOS OPERATOR (Controller)                      │  │
│  │  • Watches ChaosEngine CRDs                                  │  │
│  │  • Orchestrates chaos experiments                            │  │
│  │  • Manages experiment lifecycle                              │  │
│  └────────┬─────────────────────────────────────┬───────────────┘  │
│           │                                     │                   │
│  ┌────────▼────────┐                 ┌──────────▼──────────────┐   │
│  │  CHAOS RUNNER   │                 │  CHAOS EXPORTER         │   │
│  │  (Experiment    │                 │  (Prometheus metrics)   │   │
│  │   Executor)     │                 │                         │   │
│  └────────┬────────┘                 └─────────────────────────┘   │
│           │                                                         │
│  ┌────────▼────────────────────────────────────────────────────┐   │
│  │              TARGET APPLICATION / INFRASTRUCTURE             │   │
│  │  Pods, Deployments, StatefulSets, Nodes, etc.               │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Core Components

•Chaos Center — The control plane for LitmusChaos. Provides UI, API, experiment management, and multi-cluster coordination. Can be self-hosted or used as SaaS (ChaosNative).
•Subscriber — Cluster agent that connects target clusters to Chaos Center. Handles secure communication, receives experiment instructions, and reports results.
•Chaos Operator — Kubernetes controller that watches ChaosEngine resources and manages experiment execution lifecycle. The core runtime component.
•Chaos Runner — Pod that executes the actual chaos experiment. Contains the experiment logic and helper containers for injection and monitoring.
•Chaos Exporter — Exposes Prometheus metrics about chaos experiments, enabling observability integration and alerting during chaos.

Custom Resource Definitions:

LitmusChaos defines several CRDs that represent the chaos primitives:

LitmusChaos Custom Resources
CRD	Purpose	Lifecycle
ChaosExperiment	Defines a specific chaos experiment type (template)	Created once, reused across engines
ChaosEngine	Binds experiment to target application	Created per experiment run
ChaosResult	Records experiment outcome and observations	Created automatically by engine
ChaosSchedule	Enables scheduled/recurring experiments	Long-lived, triggers engines
Workflow	Argo-based multi-step chaos workflow	Complex experiment orchestration

Operator Pattern Benefits

By leveraging the Kubernetes operator pattern, LitmusChaos gains automatic reconciliation, declarative management, and native integration with Kubernetes RBAC, namespacing, and resource quotas. Experiments become just another Kubernetes resource to manage.

Chaos Experiments

LitmusChaos provides a rich library of pre-built chaos experiments targeting different layers of the Kubernetes stack. Understanding these experiments is key to designing effective resilience tests.

Pod-Level Experiments

•pod-delete — Deletes target pods to test Kubernetes self-healing and application recovery. Validates ReplicaSet behavior.
•pod-kill — Forcefully kills container processes. Tests restart policies and grace period handling.
•container-kill — Kills specific containers within multi-container pods. Tests sidecar resilience.
•pod-cpu-hog — Consumes CPU within target pods. Tests autoscaling and resource limit handling.
•pod-memory-hog — Consumes memory within pods. Tests OOMKilled behavior and memory limits.
•pod-io-stress — Generates I/O pressure. Tests ephemeral storage limits and I/O-bound applications.

Network Experiments

•pod-network-latency — Injects network latency at pod level. Tests timeout configurations.
•pod-network-loss — Causes packet loss between pods. Tests retry logic and idempotency.
•pod-network-corruption — Corrupts network packets. Tests checksum validation handling.
•pod-network-duplication — Duplicates network packets. Tests idempotent receivers.
•pod-dns-error — Injects DNS failures. Tests DNS caching and fallback resolution.
•pod-dns-spoof — Returns incorrect DNS responses. Tests service discovery resilience.

Node-Level Experiments

•node-drain — Drains node using kubectl drain semantics. Tests pod anti-affinity and migration.
•node-taint — Applies taints to nodes. Tests tolerations and scheduling.
•node-cpu-hog — Stresses CPU at node level. Tests noisy neighbor scenarios.
•node-memory-hog — Consumes node memory. Tests kubelet eviction behavior.
•node-io-stress — Generates I/O pressure on node. Tests all pods on affected node.
•kubelet-service-kill — Stops kubelet service. Tests node not-ready handling.

pod-delete-experiment.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# Example: Pod Delete Chaos Experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-delete
  namespace: litmus
spec:
  definition:
    # Scope and permissions
    scope: Namespaced
    permissions:
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["create", "delete", "get", "list", "patch", "update"]
      - apiGroups: ["apps"]
        resources: ["deployments", "replicasets"]
        verbs: ["get", "list"]
      - apiGroups: ["litmuschaos.io"]
        resources: ["chaosengines", "chaosexperiments", "chaosresults"]
        verbs: ["create", "get", "list", "patch", "update"]
    
    # Experiment image
    image: litmuschaos/go-runner:latest
    imagePullPolicy: Always
    
    # Arguments and environment
    args:
      - -c
      - ./experiments -name pod-delete
    command:
      - /bin/bash
    
    env:
      - name: TOTAL_CHAOS_DURATION
        value: "30"
      - name: CHAOS_INTERVAL
        value: "10"
      - name: FORCE
        value: "false"
      - name: TARGET_PODS
        value: ""  # Empty means random selection
      - name: PODS_AFFECTED_PERC
        value: "50"  # Affect 50% of matching pods
      - name: RAMP_TIME
        value: "0"
      - name: SEQUENCE
        value: "parallel"
 
---
# ChaosEngine binds experiment to target application
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-pod-delete-chaos
  namespace: production
spec:
  # Enable engine
  engineState: active
  
  # Target application
  appinfo:
    appns: production
    applabel: app=nginx
    appkind: deployment
  
  # Experiment selection
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "15"
            - name: FORCE
              value: "true"
            - name: PODS_AFFECTED_PERC
              value: "30"
        
        # Probes define success criteria
        probe:
          - name: healthcheck
            type: httpProbe
            mode: Continuous
            runProperties:
              probeTimeout: 5
              retry: 1
              interval: 5
            httpProbe/inputs:
              url: http://nginx-service:80/health
              method:
                get:
                  criteria: ==
                  responseCode: "200"
  
  # Cleanup after experiment
  annotationCheck: "true"
  chaosServiceAccount: pod-delete-sa

Service Account Permissions

Each chaos experiment requires specific Kubernetes RBAC permissions. The ChaosServiceAccount specified in the ChaosEngine must have appropriate roles bound. This is a common source of experiment failures—always verify RBAC setup before troubleshooting experiment logic.

ChaosHub: The Experiment Marketplace

ChaosHub is LitmusChaos's experiment repository—a marketplace of community-contributed and officially maintained chaos experiments. It dramatically accelerates chaos adoption by providing ready-to-use experiments for common failure scenarios.

ChaosHub Categories
Category	Experiments Available	Use Cases
Generic	pod-delete, pod-cpu-hog, network-chaos	Universal Kubernetes testing
AWS	ec2-terminate, ebs-loss, az-chaos	AWS-specific infrastructure chaos
GCP	gcp-vm-disk-loss, gcp-vm-instance-stop	GCP-specific chaos
Azure	azure-disk-loss, azure-instance-stop	Azure-specific chaos
Kafka	kafka-broker-pod-failure, kafka-disk-failure	Kafka resilience testing
Cassandra	cassandra-pod-delete, cassandra-repair	Cassandra cluster testing

Private ChaosHubs:

Organizations can create private ChaosHubs containing custom experiments tailored to their applications and failure modes. This enables sharing chaos knowledge across teams without exposing internal details publicly.

custom_chaoshub_experiment.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# Example: Custom ChaosHub Experiment for Internal Payment Service
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: payment-service-latency
  namespace: litmus
  labels:
    chaoshub: internal
    app.kubernetes.io/category: payment
    app.kubernetes.io/domain: fintech
spec:
  definition:
    scope: Namespaced
    
    permissions:
      - apiGroups: [""]
        resources: ["pods", "services"]
        verbs: ["get", "list"]
      - apiGroups: ["apps"]
        resources: ["deployments"]
        verbs: ["get", "list"]
      - apiGroups: ["litmuschaos.io"]
        resources: ["chaosengines", "chaosresults"]
        verbs: ["create", "get", "list", "patch", "update"]
    
    image: internal-registry/chaos-experiments:payment-v1.2.0
    
    # Custom experiment logic
    args:
      - -c
      - |
        # Inject latency into payment service dependencies
        ./inject-latency \
          --target-service payment-gateway \
          --upstream-latency-ms ${UPSTREAM_LATENCY_MS} \
          --downstream-latency-ms ${DOWNSTREAM_LATENCY_MS} \
          --duration ${CHAOS_DURATION}
    command:
      - /bin/bash
    
    env:
      # Payment-specific parameters
      - name: UPSTREAM_LATENCY_MS
        value: "500"
      - name: DOWNSTREAM_LATENCY_MS
        value: "200"
      - name: CHAOS_DURATION
        value: "120"
      
      # Safety parameters
      - name: ABORT_ON_ERROR_RATE
        value: "5"  # Abort if error rate exceeds 5%
      - name: MONITOR_INTERVAL
        value: "10"
      
      # Notification settings
      - name: SLACK_WEBHOOK
        valueFrom:
          secretKeyRef:
            name: chaos-secrets
            key: slack-webhook
    
    # Labels for discovery
    labels:
      experiment: payment-service-latency
      tier: critical
      owner: payments-team
    
    # ConfigMap for additional configuration
    configMaps:
      - name: payment-chaos-config
        mountPath: /etc/chaos
 
---
# Usage documentation embedded in experiment
apiVersion: v1
kind: ConfigMap
metadata:
  name: payment-chaos-config
  namespace: litmus
data:
  README.md: |
    # Payment Service Latency Experiment
    
    ## Purpose
    Tests the payment service's behavior when upstream payment providers
    experience latency. Validates circuit breaker configuration and
    fallback to cached authorization responses.
    
    ## Prerequisites
    - Payment service deployed with circuit breaker enabled
    - Fallback cache warmed with test authorization data
    - Monitoring dashboard open during experiment
    
    ## Expected Behavior
    - Circuit breaker should open at 500ms sustained latency
    - Fallback cache should serve pre-authorized transactions
    - Error rate should not exceed 2% for cached operations
    
    ## Rollback Procedure
    1. Halt ChaosEngine: kubectl delete chaosengine payment-latency-test
    2. All latency injection stops automatically
    3. Circuit breakers will close after recovery period (30s default)

Document Your Experiments

Custom ChaosHub experiments should include comprehensive documentation—purpose, prerequisites, expected behavior, and rollback procedures. This documentation becomes tribal knowledge that scales chaos practices across teams without requiring real-time expert involvement.

Chaos Workflows

LitmusChaos workflows orchestrate multiple chaos experiments into coherent test scenarios. Built on Argo Workflows, they enable complex, multi-step chaos with conditional logic, parallel execution, and integrated observability.

chaos_workflow.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
# Example: Comprehensive Service Resilience Workflow
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: service-resilience-test
  namespace: litmus
spec:
  entrypoint: resilience-test-dag
  serviceAccountName: litmus-admin
  
  # Workflow-level arguments
  arguments:
    parameters:
      - name: target-namespace
        value: production
      - name: target-deployment
        value: api-gateway
      - name: chaos-duration
        value: "120"
  
  templates:
    # DAG orchestration template
    - name: resilience-test-dag
      dag:
        tasks:
          # Step 1: Pre-chaos validation
          - name: pre-chaos-health-check
            template: health-check
            arguments:
              parameters:
                - name: phase
                  value: pre-chaos
          
          # Step 2a: Pod chaos (in parallel with Step 2b)
          - name: pod-delete-chaos
            template: run-pod-delete
            depends: pre-chaos-health-check
          
          # Step 2b: Network chaos (in parallel with Step 2a)
          - name: network-latency-chaos
            template: run-network-latency
            depends: pre-chaos-health-check
          
          # Step 3: Wait for both chaos experiments
          - name: mid-chaos-health-check
            template: health-check
            depends: "pod-delete-chaos && network-latency-chaos"
            arguments:
              parameters:
                - name: phase
                  value: mid-chaos
          
          # Step 4: Node-level chaos (escalation)
          - name: node-cpu-chaos
            template: run-node-cpu-hog
            depends: mid-chaos-health-check
            # Only run if mid-chaos check passed
            when: "{{steps.mid-chaos-health-check.outputs.result}} == passed"
          
          # Step 5: Recovery validation
          - name: post-chaos-health-check
            template: health-check
            depends: node-cpu-chaos
            arguments:
              parameters:
                - name: phase
                  value: post-chaos
          
          # Step 6: Generate report
          - name: generate-report
            template: chaos-report
            depends: post-chaos-health-check
    
    # Health check template
    - name: health-check
      inputs:
        parameters:
          - name: phase
      container:
        image: internal-registry/chaos-tools:latest
        command: ["/bin/bash", "-c"]
        args:
          - |
            echo "Running health check: {{inputs.parameters.phase}}"
            
            # Check deployment availability
            AVAILABLE=$(kubectl get deployment \
              {{workflow.parameters.target-deployment}} \
              -n {{workflow.parameters.target-namespace}} \
              -o jsonpath='{.status.availableReplicas}')
            
            DESIRED=$(kubectl get deployment \
              {{workflow.parameters.target-deployment}} \
              -n {{workflow.parameters.target-namespace}} \
              -o jsonpath='{.spec.replicas}')
            
            # Check error rate from Prometheus
            ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~"5..",deployment="{{workflow.parameters.target-deployment}}"}[5m])" | jq -r '.data.result[0].value[1]')
            
            echo "Available: $AVAILABLE, Desired: $DESIRED, Error Rate: $ERROR_RATE"
            
            # Validation logic
            if [ "$AVAILABLE" -ge "$((DESIRED - 1))" ] && [ $(echo "$ERROR_RATE < 0.05" | bc -l) -eq 1 ]; then
              echo "passed" > /tmp/result
            else
              echo "failed" > /tmp/result
            fi
      outputs:
        parameters:
          - name: result
            valueFrom:
              path: /tmp/result
    
    # Pod delete experiment template
    - name: run-pod-delete
      container:
        image: litmuschaos/litmus-operator:latest
        args:
          - -c
          - |
            kubectl apply -f - <<EOF
            apiVersion: litmuschaos.io/v1alpha1
            kind: ChaosEngine
            metadata:
              name: pod-delete-engine
              namespace: {{workflow.parameters.target-namespace}}
            spec:
              engineState: active
              appinfo:
                appns: {{workflow.parameters.target-namespace}}
                applabel: app={{workflow.parameters.target-deployment}}
                appkind: deployment
              experiments:
                - name: pod-delete
                  spec:
                    components:
                      env:
                        - name: TOTAL_CHAOS_DURATION
                          value: "{{workflow.parameters.chaos-duration}}"
                        - name: PODS_AFFECTED_PERC
                          value: "30"
            EOF
            
            # Wait for experiment completion
            while true; do
              STATUS=$(kubectl get chaosengine pod-delete-engine \
                -n {{workflow.parameters.target-namespace}} \
                -o jsonpath='{.status.engineStatus}')
              if [ "$STATUS" == "completed" ]; then
                break
              fi
              sleep 10
            done
        command:
          - /bin/bash
    
    # Network latency experiment template
    - name: run-network-latency
      container:
        image: litmuschaos/litmus-operator:latest
        args:
          - -c
          - |
            kubectl apply -f - <<EOF
            apiVersion: litmuschaos.io/v1alpha1
            kind: ChaosEngine
            metadata:
              name: network-latency-engine
              namespace: {{workflow.parameters.target-namespace}}
            spec:
              engineState: active
              appinfo:
                appns: {{workflow.parameters.target-namespace}}
                applabel: app={{workflow.parameters.target-deployment}}
                appkind: deployment
              experiments:
                - name: pod-network-latency
                  spec:
                    components:
                      env:
                        - name: TOTAL_CHAOS_DURATION
                          value: "{{workflow.parameters.chaos-duration}}"
                        - name: NETWORK_LATENCY
                          value: "200"
                        - name: CONTAINER_RUNTIME
                          value: containerd
            EOF
            
            # Wait for completion
            while true; do
              STATUS=$(kubectl get chaosengine network-latency-engine \
                -n {{workflow.parameters.target-namespace}} \
                -o jsonpath='{.status.engineStatus}')
              if [ "$STATUS" == "completed" ]; then
                break
              fi
              sleep 10
            done
        command:
          - /bin/bash
    
    # Additional templates for node-cpu-hog and report generation...

Workflow Power

Workflows unlock testing scenarios impossible with single experiments: cascade failures, progressive stress testing, recovery validation, and conditional chaos based on system behavior during the experiment.

Probes and Observability

LitmusChaos probes enable hypothesis-driven chaos engineering by validating system behavior during experiments. They transform chaos from 'break things and see what happens' into scientific experiments with measurable outcomes.

LitmusChaos Probe Types
Probe Type	Mechanism	Use Case
httpProbe	HTTP request to endpoint	Service health endpoints, API availability
cmdProbe	Execute shell command	Custom validation scripts, database queries
k8sProbe	Kubernetes API checks	Resource existence, status validation
promProbe	Prometheus query	Metric-based validation, SLO verification

comprehensive_probes.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# Example: ChaosEngine with Comprehensive Probes
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-service-chaos
  namespace: production
spec:
  engineState: active
  appinfo:
    appns: production
    applabel: app=payment-service
    appkind: deployment
  
  experiments:
    - name: pod-delete
      spec:
        probe:
          # HTTP Probe: Verify service is responding
          - name: payment-health-probe
            type: httpProbe
            mode: Continuous
            runProperties:
              probeTimeout: 5
              retry: 3
              interval: 2
              probePollingInterval: 1
              initialDelaySeconds: 3
            httpProbe/inputs:
              url: http://payment-service.production.svc:8080/health
              method:
                get:
                  criteria: ==
                  responseCode: "200"
          
          # Prometheus Probe: Verify error rate SLO
          - name: error-rate-slo-probe
            type: promProbe
            mode: Edge  # Check at start and end
            runProperties:
              probeTimeout: 10
              retry: 2
              interval: 5
            promProbe/inputs:
              endpoint: http://prometheus.monitoring.svc:9090
              query: |
                sum(rate(http_requests_total{service="payment-service",status=~"5.."}[5m])) /
                sum(rate(http_requests_total{service="payment-service"}[5m])) * 100
              comparator:
                type: float
                criteria: <=
                value: "1.0"  # Error rate <= 1%
          
          # Kubernetes Probe: Verify minimum replicas
          - name: replica-count-probe
            type: k8sProbe
            mode: Continuous
            runProperties:
              probeTimeout: 5
              retry: 2
              interval: 3
            k8sProbe/inputs:
              group: apps
              version: v1
              resource: deployments
              namespace: production
              fieldSelector: metadata.name=payment-service
              operation: present
          
          # Command Probe: Custom database connectivity check
          - name: database-connectivity-probe
            type: cmdProbe
            mode: OnChaos
            runProperties:
              probeTimeout: 30
              retry: 1
              interval: 10
            cmdProbe/inputs:
              command: |
                /bin/bash -c '
                  # Check database connection from a test pod
                  kubectl run db-test --rm -i --restart=Never \
                    --image=postgres:13 \
                    --namespace=production \
                    -- psql -h postgres -U app -c "SELECT 1" || exit 1
                  echo "Database connectivity verified"
                '
              source: inline
              comparator:
                type: string
                criteria: contains
                value: "Database connectivity verified"
 
  # Annotation for Prometheus metrics
  monitoringEnabled: true
  
  # Cleanup policy
  chaosServiceAccount: litmus-admin
  annotationCheck: "false"

Observability integration:

LitmusChaos exports metrics to Prometheus, enabling integration with existing observability stacks:

prometheus_alerts.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Prometheus alerts for LitmusChaos events
groups:
  - name: litmus-chaos-alerts
    rules:
      # Alert when chaos experiments fail
      - alert: ChaosExperimentFailed
        expr: litmuschaos_experiment_verdict{verdict="Fail"} > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Chaos experiment failed: {{ $labels.chaosengine_name }}"
          description: |
            Experiment {{ $labels.experiment_name }} in namespace 
            {{ $labels.namespace }} failed. This may indicate a 
            resilience gap in {{ $labels.app_label }}.
          runbook: https://wiki.company.com/chaos/experiment-failures
      
      # Alert when probe failures occur during chaos
      - alert: ChaosProbeFailure
        expr: litmuschaos_probe_success_percentage < 100
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Chaos probe failures detected"
          description: |
            Probe {{ $labels.probe_name }} had failures during chaos 
            experiment {{ $labels.chaosengine_name }}. Success rate: 
            {{ $value }}%
      
      # Alert on unusually high chaos activity
      - alert: HighChaosActivity
        expr: sum(rate(litmuschaos_experiment_total[1h])) > 10
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "High chaos engineering activity detected"
          description: |
            More than 10 chaos experiments triggered in the last 
            hour. Verify this is expected activity.

Hypothesis-Driven Chaos

Probes encode your hypothesis: 'I expect error rate to stay below 1% during pod deletion.' If probes pass, your hypothesis is validated. If they fail, you've discovered a resilience gap. This transforms chaos from exploratory destruction into structured experimentation.

GitOps Integration

LitmusChaos's CRD-based architecture naturally fits GitOps workflows. Chaos experiments become versioned, reviewed, and deployed alongside application code.

Chaos-as-Code benefits:

GitOps Chaos Advantages

•Version control — Experiment definitions tracked in Git. Review changes before deployment. Audit who changed what.
•Code review for chaos — Experiment changes go through PR review. Team validates hypothesis, targeting, and safety controls.
•Environment promotion — Same experiment tested in staging before production. Promotes confidence.
•Drift detection — Argo CD or Flux ensures experiments match Git definitions. No undocumented changes.
•Rollback capability — Revert to previous experiment versions instantly via Git revert.

Repository Structure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# GitOps Repository Structure for Chaos-as-Code
 
chaos-experiments/
├── base/                          # Shared experiment definitions
│   ├── experiments/
│   │   ├── pod-delete.yaml
│   │   ├── network-latency.yaml
│   │   └── node-cpu-hog.yaml
│   └── kustomization.yaml
│
├── environments/
│   ├── staging/                   # Staging-specific configuration
│   │   ├── engines/
│   │   │   ├── api-gateway-chaos.yaml
│   │   │   └── payment-service-chaos.yaml
│   │   ├── schedules/
│   │   │   └── daily-resilience-test.yaml
│   │   └── kustomization.yaml
│   │
│   └── production/                # Production-specific configuration
│       ├── engines/
│       │   ├── api-gateway-chaos.yaml
│       │   └── payment-service-chaos.yaml
│       ├── schedules/
│       │   └── weekly-gameday.yaml
│       └── kustomization.yaml
│
├── workflows/                     # Complex multi-step workflows
│   ├── service-resilience-test.yaml
│   ├── database-failover-test.yaml
│   └── full-stack-chaos.yaml
│
└── README.md                      # Documentation and runbooks

argocd_application.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# Argo CD Application for Chaos Experiments
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: chaos-experiments-production
  namespace: argocd
spec:
  project: chaos-engineering
  
  source:
    repoURL: https://github.com/company/chaos-experiments.git
    targetRevision: main
    path: environments/production
    
    # Kustomize for environment-specific patches
    kustomize:
      namePrefix: prod-
      commonLabels:
        environment: production
        managed-by: argocd
  
  destination:
    server: https://kubernetes.default.svc
    namespace: litmus
  
  syncPolicy:
    automated:
      prune: true  # Remove experiments no longer in Git
      selfHeal: true  # Revert manual changes
    syncOptions:
      - CreateNamespace=true
    
    # Retry on transient failures
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
 
---
# Argo CD Project with appropriate permissions
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: chaos-engineering
  namespace: argocd
spec:
  description: Chaos Engineering experiments and workflows
  
  # Limit which clusters can receive chaos
  destinations:
    - namespace: litmus
      server: https://kubernetes.default.svc
    - namespace: staging
      server: https://kubernetes.default.svc
    - namespace: production
      server: https://kubernetes.default.svc
  
  # Limit which repos can define chaos
  sourceRepos:
    - https://github.com/company/chaos-experiments.git
  
  # Cluster resources chaos can manage
  clusterResourceWhitelist:
    - group: ""
      kind: Namespace
    - group: litmuschaos.io
      kind: "*"
  
  # Namespace resources
  namespaceResourceWhitelist:
    - group: ""
      kind: ServiceAccount
    - group: rbac.authorization.k8s.io
      kind: Role
    - group: rbac.authorization.k8s.io
      kind: RoleBinding
    - group: litmuschaos.io
      kind: "*"

Summary: LitmusChaos in the Cloud-Native Stack

LitmusChaos brings chaos engineering natively into the Kubernetes ecosystem, speaking the language platform engineers already know: CRDs, operators, and declarative YAML.

Key Takeaways

•Kubernetes adds unique failure modes — Pod eviction, service discovery, resource quotas, and network policies create failures that traditional chaos tools miss.
•LitmusChaos uses the operator pattern — CRDs for experiments, engines, and results integrate naturally with Kubernetes tooling and practices.
•ChaosHub accelerates adoption — Pre-built experiments for common scenarios and the ability to share custom experiments across teams.
•Workflows enable complex scenarios — Argo-based orchestration creates multi-step chaos with conditional logic and parallel execution.
•Probes make chaos scientific — Hypothesis-driven experimentation with measurable outcomes rather than exploratory destruction.
•GitOps integration is natural — CRD-based experiments fit GitOps workflows with version control, review, and automated deployment.
•CNCF incubation indicates maturity — Strong community, active development, and production usage across many organizations.

When to choose LitmusChaos:

Your infrastructure is primarily Kubernetes-based
You want open-source tooling with community support
GitOps is central to your operational model
You need Kubernetes-specific failure modes (pod eviction, service discovery, etc.)
Your team is comfortable with Kubernetes CRDs and operators

What's next:

LitmusChaos excels in Kubernetes environments. But what about chaos specifically designed for cloud-native service meshes and advanced Kubernetes patterns? In the next page, we'll explore Chaos Mesh, another CNCF project focused on fine-grained chaos for complex Kubernetes deployments.

Page Complete

You now understand LitmusChaos's Kubernetes-native architecture, experiment types, ChaosHub ecosystem, workflows, probes, and GitOps integration. LitmusChaos brings chaos engineering into the cloud-native stack as a first-class citizen, enabling platform engineers to test resilience using the same tools and patterns they use for everything else.

3 / 5

Loading learning content...

System Design (HLD)Chaos Tools

Chaos Engineering Tools

LevelAdvanced

Duration120 mins

TopicChaos Tools

3 / 5

LitmusChaos: Kubernetes-Native Chaos Engineering

Chaos Engineering for the Cloud-Native Era

LitmusChaos emerged to bring chaos engineering natively into the Kubernetes ecosystem.

What You Will Learn

Why Kubernetes-Native Chaos?

Kubernetes-Specific Failure Modes
Failure Mode	Traditional Chaos Approach	Kubernetes-Native Approach
Pod eviction	Terminate container process	Evict via Kubernetes API, test rescheduling
Service discovery failure	Block DNS	Delete Endpoints, test service mesh recovery
Resource quota exhaustion	Consume resources	Create competing pods, test scheduler behavior
Network policy misconfiguration	Block network	Validate NetworkPolicy failures
Persistent volume issues	Fill disk	Detach PV, test StatefulSet recovery
Node drain	Shutdown VM	kubectl drain, test pod migration

The Kubernetes failure surface:

Kubernetes adds multiple abstraction layers, each with potential failure points:

Kubernetes Abstraction Layers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
┌─────────────────────────────────────────────────────────────────────┐
│                    KUBERNETES FAILURE SURFACE                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Layer 1: APPLICATION                                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Container crashes, OOM kills, application errors           │   │
│  │  Health check failures, readiness probe issues              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  Layer 2: POD                                                        │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Pod eviction, pod preemption, init container failures      │   │
│  │  Sidecar issues, volume mount failures                      │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  Layer 3: WORKLOAD CONTROLLERS                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Deployment rollout failures, ReplicaSet issues             │   │
│  │  StatefulSet ordering violations, DaemonSet gaps            │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  Layer 4: SERVICES & NETWORKING                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Service routing failures, Ingress misconfigurations        │   │
│  │  NetworkPolicy blocks, DNS resolution failures              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  Layer 5: CLUSTER INFRASTRUCTURE                                    │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Node failures, etcd issues, API server unavailability      │   │
│  │  Scheduler failures, controller-manager issues              │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  Layer 6: CLOUD PROVIDER                                            │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Load balancer issues, PersistentVolume provisioning        │   │
│  │  Cloud controller failures, CSI driver problems             │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Native Integration Benefits

LitmusChaos Architecture

LitmusChaos Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
┌─────────────────────────────────────────────────────────────────────┐
│                     LITMUSCHAOS ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    LITMUS PORTAL (UI)                         │  │
│  │  ┌────────────┐  ┌────────────┐  ┌─────────────────────────┐ │  │
│  │  │ Experiment │  │  Workflow  │  │   Analytics &           │ │  │
│  │  │  Designer  │  │  Builder   │  │   Observability         │ │  │
│  │  └────────────┘  └────────────┘  └─────────────────────────┘ │  │
│  └─────────────────────────┬────────────────────────────────────┘  │
│                            │                                        │
│  ┌─────────────────────────▼────────────────────────────────────┐  │
│  │                   CHAOS CENTER (Backend)                      │  │
│  │  ┌────────────┐  ┌────────────┐  ┌─────────────────────────┐ │  │
│  │  │  GraphQL   │  │  Mongo DB  │  │   Authentication &      │ │  │
│  │  │    API     │  │  (State)   │  │   Authorization         │ │  │
│  │  └────────────┘  └────────────┘  └─────────────────────────┘ │  │
│  └─────────────────────────┬────────────────────────────────────┘  │
│                            │                                        │
│  ════════════════════════════════════════════════════════════════  │
│                         CLUSTER BOUNDARY                            │
│  ════════════════════════════════════════════════════════════════  │
│                            │                                        │
│  ┌─────────────────────────▼────────────────────────────────────┐  │
│  │              SUBSCRIBER (Cluster Agent)                       │  │
│  │  • Connects cluster to Chaos Center                          │  │
│  │  • Receives workflow execution requests                      │  │
│  │  • Reports experiment results back                           │  │
│  └─────────────────────────┬────────────────────────────────────┘  │
│                            │                                        │
│  ┌─────────────────────────▼────────────────────────────────────┐  │
│  │              CHAOS OPERATOR (Controller)                      │  │
│  │  • Watches ChaosEngine CRDs                                  │  │
│  │  • Orchestrates chaos experiments                            │  │
│  │  • Manages experiment lifecycle                              │  │
│  └────────┬─────────────────────────────────────┬───────────────┘  │
│           │                                     │                   │
│  ┌────────▼────────┐                 ┌──────────▼──────────────┐   │
│  │  CHAOS RUNNER   │                 │  CHAOS EXPORTER         │   │
│  │  (Experiment    │                 │  (Prometheus metrics)   │   │
│  │   Executor)     │                 │                         │   │
│  └────────┬────────┘                 └─────────────────────────┘   │
│           │                                                         │
│  ┌────────▼────────────────────────────────────────────────────┐   │
│  │              TARGET APPLICATION / INFRASTRUCTURE             │   │
│  │  Pods, Deployments, StatefulSets, Nodes, etc.               │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Core Components

•Chaos Center — The control plane for LitmusChaos. Provides UI, API, experiment management, and multi-cluster coordination. Can be self-hosted or used as SaaS (ChaosNative).
•Subscriber — Cluster agent that connects target clusters to Chaos Center. Handles secure communication, receives experiment instructions, and reports results.
•Chaos Operator — Kubernetes controller that watches ChaosEngine resources and manages experiment execution lifecycle. The core runtime component.
•Chaos Runner — Pod that executes the actual chaos experiment. Contains the experiment logic and helper containers for injection and monitoring.
•Chaos Exporter — Exposes Prometheus metrics about chaos experiments, enabling observability integration and alerting during chaos.

Custom Resource Definitions:

LitmusChaos defines several CRDs that represent the chaos primitives:

LitmusChaos Custom Resources
CRD	Purpose	Lifecycle
ChaosExperiment	Defines a specific chaos experiment type (template)	Created once, reused across engines
ChaosEngine	Binds experiment to target application	Created per experiment run
ChaosResult	Records experiment outcome and observations	Created automatically by engine
ChaosSchedule	Enables scheduled/recurring experiments	Long-lived, triggers engines
Workflow	Argo-based multi-step chaos workflow	Complex experiment orchestration

Operator Pattern Benefits

Chaos Experiments

LitmusChaos provides a rich library of pre-built chaos experiments targeting different layers of the Kubernetes stack. Understanding these experiments is key to designing effective resilience tests.

Pod-Level Experiments

•pod-delete — Deletes target pods to test Kubernetes self-healing and application recovery. Validates ReplicaSet behavior.
•pod-kill — Forcefully kills container processes. Tests restart policies and grace period handling.
•container-kill — Kills specific containers within multi-container pods. Tests sidecar resilience.
•pod-cpu-hog — Consumes CPU within target pods. Tests autoscaling and resource limit handling.
•pod-memory-hog — Consumes memory within pods. Tests OOMKilled behavior and memory limits.
•pod-io-stress — Generates I/O pressure. Tests ephemeral storage limits and I/O-bound applications.

Network Experiments

•pod-network-latency — Injects network latency at pod level. Tests timeout configurations.
•pod-network-loss — Causes packet loss between pods. Tests retry logic and idempotency.
•pod-network-corruption — Corrupts network packets. Tests checksum validation handling.
•pod-network-duplication — Duplicates network packets. Tests idempotent receivers.
•pod-dns-error — Injects DNS failures. Tests DNS caching and fallback resolution.
•pod-dns-spoof — Returns incorrect DNS responses. Tests service discovery resilience.

Node-Level Experiments

•node-drain — Drains node using kubectl drain semantics. Tests pod anti-affinity and migration.
•node-taint — Applies taints to nodes. Tests tolerations and scheduling.
•node-cpu-hog — Stresses CPU at node level. Tests noisy neighbor scenarios.
•node-memory-hog — Consumes node memory. Tests kubelet eviction behavior.
•node-io-stress — Generates I/O pressure on node. Tests all pods on affected node.
•kubelet-service-kill — Stops kubelet service. Tests node not-ready handling.

pod-delete-experiment.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# Example: Pod Delete Chaos Experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-delete
  namespace: litmus
spec:
  definition:
    # Scope and permissions
    scope: Namespaced
    permissions:
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["create", "delete", "get", "list", "patch", "update"]
      - apiGroups: ["apps"]
        resources: ["deployments", "replicasets"]
        verbs: ["get", "list"]
      - apiGroups: ["litmuschaos.io"]
        resources: ["chaosengines", "chaosexperiments", "chaosresults"]
        verbs: ["create", "get", "list", "patch", "update"]
    
    # Experiment image
    image: litmuschaos/go-runner:latest
    imagePullPolicy: Always
    
    # Arguments and environment
    args:
      - -c
      - ./experiments -name pod-delete
    command:
      - /bin/bash
    
    env:
      - name: TOTAL_CHAOS_DURATION
        value: "30"
      - name: CHAOS_INTERVAL
        value: "10"
      - name: FORCE
        value: "false"
      - name: TARGET_PODS
        value: ""  # Empty means random selection
      - name: PODS_AFFECTED_PERC
        value: "50"  # Affect 50% of matching pods
      - name: RAMP_TIME
        value: "0"
      - name: SEQUENCE
        value: "parallel"
 
---
# ChaosEngine binds experiment to target application
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-pod-delete-chaos
  namespace: production
spec:
  # Enable engine
  engineState: active
  
  # Target application
  appinfo:
    appns: production
    applabel: app=nginx
    appkind: deployment
  
  # Experiment selection
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "15"
            - name: FORCE
              value: "true"
            - name: PODS_AFFECTED_PERC
              value: "30"
        
        # Probes define success criteria
        probe:
          - name: healthcheck
            type: httpProbe
            mode: Continuous
            runProperties:
              probeTimeout: 5
              retry: 1
              interval: 5
            httpProbe/inputs:
              url: http://nginx-service:80/health
              method:
                get:
                  criteria: ==
                  responseCode: "200"
  
  # Cleanup after experiment
  annotationCheck: "true"
  chaosServiceAccount: pod-delete-sa

Service Account Permissions

ChaosHub: The Experiment Marketplace

ChaosHub Categories
Category	Experiments Available	Use Cases
Generic	pod-delete, pod-cpu-hog, network-chaos	Universal Kubernetes testing
AWS	ec2-terminate, ebs-loss, az-chaos	AWS-specific infrastructure chaos
GCP	gcp-vm-disk-loss, gcp-vm-instance-stop	GCP-specific chaos
Azure	azure-disk-loss, azure-instance-stop	Azure-specific chaos
Kafka	kafka-broker-pod-failure, kafka-disk-failure	Kafka resilience testing
Cassandra	cassandra-pod-delete, cassandra-repair	Cassandra cluster testing

Private ChaosHubs:

custom_chaoshub_experiment.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# Example: Custom ChaosHub Experiment for Internal Payment Service
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: payment-service-latency
  namespace: litmus
  labels:
    chaoshub: internal
    app.kubernetes.io/category: payment
    app.kubernetes.io/domain: fintech
spec:
  definition:
    scope: Namespaced
    
    permissions:
      - apiGroups: [""]
        resources: ["pods", "services"]
        verbs: ["get", "list"]
      - apiGroups: ["apps"]
        resources: ["deployments"]
        verbs: ["get", "list"]
      - apiGroups: ["litmuschaos.io"]
        resources: ["chaosengines", "chaosresults"]
        verbs: ["create", "get", "list", "patch", "update"]
    
    image: internal-registry/chaos-experiments:payment-v1.2.0
    
    # Custom experiment logic
    args:
      - -c
      - |
        # Inject latency into payment service dependencies
        ./inject-latency \
          --target-service payment-gateway \
          --upstream-latency-ms ${UPSTREAM_LATENCY_MS} \
          --downstream-latency-ms ${DOWNSTREAM_LATENCY_MS} \
          --duration ${CHAOS_DURATION}
    command:
      - /bin/bash
    
    env:
      # Payment-specific parameters
      - name: UPSTREAM_LATENCY_MS
        value: "500"
      - name: DOWNSTREAM_LATENCY_MS
        value: "200"
      - name: CHAOS_DURATION
        value: "120"
      
      # Safety parameters
      - name: ABORT_ON_ERROR_RATE
        value: "5"  # Abort if error rate exceeds 5%
      - name: MONITOR_INTERVAL
        value: "10"
      
      # Notification settings
      - name: SLACK_WEBHOOK
        valueFrom:
          secretKeyRef:
            name: chaos-secrets
            key: slack-webhook
    
    # Labels for discovery
    labels:
      experiment: payment-service-latency
      tier: critical
      owner: payments-team
    
    # ConfigMap for additional configuration
    configMaps:
      - name: payment-chaos-config
        mountPath: /etc/chaos
 
---
# Usage documentation embedded in experiment
apiVersion: v1
kind: ConfigMap
metadata:
  name: payment-chaos-config
  namespace: litmus
data:
  README.md: |
    # Payment Service Latency Experiment
    
    ## Purpose
    Tests the payment service's behavior when upstream payment providers
    experience latency. Validates circuit breaker configuration and
    fallback to cached authorization responses.
    
    ## Prerequisites
    - Payment service deployed with circuit breaker enabled
    - Fallback cache warmed with test authorization data
    - Monitoring dashboard open during experiment
    
    ## Expected Behavior
    - Circuit breaker should open at 500ms sustained latency
    - Fallback cache should serve pre-authorized transactions
    - Error rate should not exceed 2% for cached operations
    
    ## Rollback Procedure
    1. Halt ChaosEngine: kubectl delete chaosengine payment-latency-test
    2. All latency injection stops automatically
    3. Circuit breakers will close after recovery period (30s default)

Document Your Experiments

Chaos Workflows

chaos_workflow.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
# Example: Comprehensive Service Resilience Workflow
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: service-resilience-test
  namespace: litmus
spec:
  entrypoint: resilience-test-dag
  serviceAccountName: litmus-admin
  
  # Workflow-level arguments
  arguments:
    parameters:
      - name: target-namespace
        value: production
      - name: target-deployment
        value: api-gateway
      - name: chaos-duration
        value: "120"
  
  templates:
    # DAG orchestration template
    - name: resilience-test-dag
      dag:
        tasks:
          # Step 1: Pre-chaos validation
          - name: pre-chaos-health-check
            template: health-check
            arguments:
              parameters:
                - name: phase
                  value: pre-chaos
          
          # Step 2a: Pod chaos (in parallel with Step 2b)
          - name: pod-delete-chaos
            template: run-pod-delete
            depends: pre-chaos-health-check
          
          # Step 2b: Network chaos (in parallel with Step 2a)
          - name: network-latency-chaos
            template: run-network-latency
            depends: pre-chaos-health-check
          
          # Step 3: Wait for both chaos experiments
          - name: mid-chaos-health-check
            template: health-check
            depends: "pod-delete-chaos && network-latency-chaos"
            arguments:
              parameters:
                - name: phase
                  value: mid-chaos
          
          # Step 4: Node-level chaos (escalation)
          - name: node-cpu-chaos
            template: run-node-cpu-hog
            depends: mid-chaos-health-check
            # Only run if mid-chaos check passed
            when: "{{steps.mid-chaos-health-check.outputs.result}} == passed"
          
          # Step 5: Recovery validation
          - name: post-chaos-health-check
            template: health-check
            depends: node-cpu-chaos
            arguments:
              parameters:
                - name: phase
                  value: post-chaos
          
          # Step 6: Generate report
          - name: generate-report
            template: chaos-report
            depends: post-chaos-health-check
    
    # Health check template
    - name: health-check
      inputs:
        parameters:
          - name: phase
      container:
        image: internal-registry/chaos-tools:latest
        command: ["/bin/bash", "-c"]
        args:
          - |
            echo "Running health check: {{inputs.parameters.phase}}"
            
            # Check deployment availability
            AVAILABLE=$(kubectl get deployment \
              {{workflow.parameters.target-deployment}} \
              -n {{workflow.parameters.target-namespace}} \
              -o jsonpath='{.status.availableReplicas}')
            
            DESIRED=$(kubectl get deployment \
              {{workflow.parameters.target-deployment}} \
              -n {{workflow.parameters.target-namespace}} \
              -o jsonpath='{.spec.replicas}')
            
            # Check error rate from Prometheus
            ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~"5..",deployment="{{workflow.parameters.target-deployment}}"}[5m])" | jq -r '.data.result[0].value[1]')
            
            echo "Available: $AVAILABLE, Desired: $DESIRED, Error Rate: $ERROR_RATE"
            
            # Validation logic
            if [ "$AVAILABLE" -ge "$((DESIRED - 1))" ] && [ $(echo "$ERROR_RATE < 0.05" | bc -l) -eq 1 ]; then
              echo "passed" > /tmp/result
            else
              echo "failed" > /tmp/result
            fi
      outputs:
        parameters:
          - name: result
            valueFrom:
              path: /tmp/result
    
    # Pod delete experiment template
    - name: run-pod-delete
      container:
        image: litmuschaos/litmus-operator:latest
        args:
          - -c
          - |
            kubectl apply -f - <<EOF
            apiVersion: litmuschaos.io/v1alpha1
            kind: ChaosEngine
            metadata:
              name: pod-delete-engine
              namespace: {{workflow.parameters.target-namespace}}
            spec:
              engineState: active
              appinfo:
                appns: {{workflow.parameters.target-namespace}}
                applabel: app={{workflow.parameters.target-deployment}}
                appkind: deployment
              experiments:
                - name: pod-delete
                  spec:
                    components:
                      env:
                        - name: TOTAL_CHAOS_DURATION
                          value: "{{workflow.parameters.chaos-duration}}"
                        - name: PODS_AFFECTED_PERC
                          value: "30"
            EOF
            
            # Wait for experiment completion
            while true; do
              STATUS=$(kubectl get chaosengine pod-delete-engine \
                -n {{workflow.parameters.target-namespace}} \
                -o jsonpath='{.status.engineStatus}')
              if [ "$STATUS" == "completed" ]; then
                break
              fi
              sleep 10
            done
        command:
          - /bin/bash
    
    # Network latency experiment template
    - name: run-network-latency
      container:
        image: litmuschaos/litmus-operator:latest
        args:
          - -c
          - |
            kubectl apply -f - <<EOF
            apiVersion: litmuschaos.io/v1alpha1
            kind: ChaosEngine
            metadata:
              name: network-latency-engine
              namespace: {{workflow.parameters.target-namespace}}
            spec:
              engineState: active
              appinfo:
                appns: {{workflow.parameters.target-namespace}}
                applabel: app={{workflow.parameters.target-deployment}}
                appkind: deployment
              experiments:
                - name: pod-network-latency
                  spec:
                    components:
                      env:
                        - name: TOTAL_CHAOS_DURATION
                          value: "{{workflow.parameters.chaos-duration}}"
                        - name: NETWORK_LATENCY
                          value: "200"
                        - name: CONTAINER_RUNTIME
                          value: containerd
            EOF
            
            # Wait for completion
            while true; do
              STATUS=$(kubectl get chaosengine network-latency-engine \
                -n {{workflow.parameters.target-namespace}} \
                -o jsonpath='{.status.engineStatus}')
              if [ "$STATUS" == "completed" ]; then
                break
              fi
              sleep 10
            done
        command:
          - /bin/bash
    
    # Additional templates for node-cpu-hog and report generation...

Workflow Power

Probes and Observability

LitmusChaos Probe Types
Probe Type	Mechanism	Use Case
httpProbe	HTTP request to endpoint	Service health endpoints, API availability
cmdProbe	Execute shell command	Custom validation scripts, database queries
k8sProbe	Kubernetes API checks	Resource existence, status validation
promProbe	Prometheus query	Metric-based validation, SLO verification

comprehensive_probes.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# Example: ChaosEngine with Comprehensive Probes
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-service-chaos
  namespace: production
spec:
  engineState: active
  appinfo:
    appns: production
    applabel: app=payment-service
    appkind: deployment
  
  experiments:
    - name: pod-delete
      spec:
        probe:
          # HTTP Probe: Verify service is responding
          - name: payment-health-probe
            type: httpProbe
            mode: Continuous
            runProperties:
              probeTimeout: 5
              retry: 3
              interval: 2
              probePollingInterval: 1
              initialDelaySeconds: 3
            httpProbe/inputs:
              url: http://payment-service.production.svc:8080/health
              method:
                get:
                  criteria: ==
                  responseCode: "200"
          
          # Prometheus Probe: Verify error rate SLO
          - name: error-rate-slo-probe
            type: promProbe
            mode: Edge  # Check at start and end
            runProperties:
              probeTimeout: 10
              retry: 2
              interval: 5
            promProbe/inputs:
              endpoint: http://prometheus.monitoring.svc:9090
              query: |
                sum(rate(http_requests_total{service="payment-service",status=~"5.."}[5m])) /
                sum(rate(http_requests_total{service="payment-service"}[5m])) * 100
              comparator:
                type: float
                criteria: <=
                value: "1.0"  # Error rate <= 1%
          
          # Kubernetes Probe: Verify minimum replicas
          - name: replica-count-probe
            type: k8sProbe
            mode: Continuous
            runProperties:
              probeTimeout: 5
              retry: 2
              interval: 3
            k8sProbe/inputs:
              group: apps
              version: v1
              resource: deployments
              namespace: production
              fieldSelector: metadata.name=payment-service
              operation: present
          
          # Command Probe: Custom database connectivity check
          - name: database-connectivity-probe
            type: cmdProbe
            mode: OnChaos
            runProperties:
              probeTimeout: 30
              retry: 1
              interval: 10
            cmdProbe/inputs:
              command: |
                /bin/bash -c '
                  # Check database connection from a test pod
                  kubectl run db-test --rm -i --restart=Never \
                    --image=postgres:13 \
                    --namespace=production \
                    -- psql -h postgres -U app -c "SELECT 1" || exit 1
                  echo "Database connectivity verified"
                '
              source: inline
              comparator:
                type: string
                criteria: contains
                value: "Database connectivity verified"
 
  # Annotation for Prometheus metrics
  monitoringEnabled: true
  
  # Cleanup policy
  chaosServiceAccount: litmus-admin
  annotationCheck: "false"

Observability integration:

LitmusChaos exports metrics to Prometheus, enabling integration with existing observability stacks:

prometheus_alerts.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Prometheus alerts for LitmusChaos events
groups:
  - name: litmus-chaos-alerts
    rules:
      # Alert when chaos experiments fail
      - alert: ChaosExperimentFailed
        expr: litmuschaos_experiment_verdict{verdict="Fail"} > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Chaos experiment failed: {{ $labels.chaosengine_name }}"
          description: |
            Experiment {{ $labels.experiment_name }} in namespace 
            {{ $labels.namespace }} failed. This may indicate a 
            resilience gap in {{ $labels.app_label }}.
          runbook: https://wiki.company.com/chaos/experiment-failures
      
      # Alert when probe failures occur during chaos
      - alert: ChaosProbeFailure
        expr: litmuschaos_probe_success_percentage < 100
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Chaos probe failures detected"
          description: |
            Probe {{ $labels.probe_name }} had failures during chaos 
            experiment {{ $labels.chaosengine_name }}. Success rate: 
            {{ $value }}%
      
      # Alert on unusually high chaos activity
      - alert: HighChaosActivity
        expr: sum(rate(litmuschaos_experiment_total[1h])) > 10
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "High chaos engineering activity detected"
          description: |
            More than 10 chaos experiments triggered in the last 
            hour. Verify this is expected activity.

Hypothesis-Driven Chaos

GitOps Integration

LitmusChaos's CRD-based architecture naturally fits GitOps workflows. Chaos experiments become versioned, reviewed, and deployed alongside application code.

Chaos-as-Code benefits:

GitOps Chaos Advantages

•Version control — Experiment definitions tracked in Git. Review changes before deployment. Audit who changed what.
•Code review for chaos — Experiment changes go through PR review. Team validates hypothesis, targeting, and safety controls.
•Environment promotion — Same experiment tested in staging before production. Promotes confidence.
•Drift detection — Argo CD or Flux ensures experiments match Git definitions. No undocumented changes.
•Rollback capability — Revert to previous experiment versions instantly via Git revert.

Repository Structure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# GitOps Repository Structure for Chaos-as-Code
 
chaos-experiments/
├── base/                          # Shared experiment definitions
│   ├── experiments/
│   │   ├── pod-delete.yaml
│   │   ├── network-latency.yaml
│   │   └── node-cpu-hog.yaml
│   └── kustomization.yaml
│
├── environments/
│   ├── staging/                   # Staging-specific configuration
│   │   ├── engines/
│   │   │   ├── api-gateway-chaos.yaml
│   │   │   └── payment-service-chaos.yaml
│   │   ├── schedules/
│   │   │   └── daily-resilience-test.yaml
│   │   └── kustomization.yaml
│   │
│   └── production/                # Production-specific configuration
│       ├── engines/
│       │   ├── api-gateway-chaos.yaml
│       │   └── payment-service-chaos.yaml
│       ├── schedules/
│       │   └── weekly-gameday.yaml
│       └── kustomization.yaml
│
├── workflows/                     # Complex multi-step workflows
│   ├── service-resilience-test.yaml
│   ├── database-failover-test.yaml
│   └── full-stack-chaos.yaml
│
└── README.md                      # Documentation and runbooks

argocd_application.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# Argo CD Application for Chaos Experiments
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: chaos-experiments-production
  namespace: argocd
spec:
  project: chaos-engineering
  
  source:
    repoURL: https://github.com/company/chaos-experiments.git
    targetRevision: main
    path: environments/production
    
    # Kustomize for environment-specific patches
    kustomize:
      namePrefix: prod-
      commonLabels:
        environment: production
        managed-by: argocd
  
  destination:
    server: https://kubernetes.default.svc
    namespace: litmus
  
  syncPolicy:
    automated:
      prune: true  # Remove experiments no longer in Git
      selfHeal: true  # Revert manual changes
    syncOptions:
      - CreateNamespace=true
    
    # Retry on transient failures
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
 
---
# Argo CD Project with appropriate permissions
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: chaos-engineering
  namespace: argocd
spec:
  description: Chaos Engineering experiments and workflows
  
  # Limit which clusters can receive chaos
  destinations:
    - namespace: litmus
      server: https://kubernetes.default.svc
    - namespace: staging
      server: https://kubernetes.default.svc
    - namespace: production
      server: https://kubernetes.default.svc
  
  # Limit which repos can define chaos
  sourceRepos:
    - https://github.com/company/chaos-experiments.git
  
  # Cluster resources chaos can manage
  clusterResourceWhitelist:
    - group: ""
      kind: Namespace
    - group: litmuschaos.io
      kind: "*"
  
  # Namespace resources
  namespaceResourceWhitelist:
    - group: ""
      kind: ServiceAccount
    - group: rbac.authorization.k8s.io
      kind: Role
    - group: rbac.authorization.k8s.io
      kind: RoleBinding
    - group: litmuschaos.io
      kind: "*"

Summary: LitmusChaos in the Cloud-Native Stack

LitmusChaos brings chaos engineering natively into the Kubernetes ecosystem, speaking the language platform engineers already know: CRDs, operators, and declarative YAML.

Key Takeaways

•Kubernetes adds unique failure modes — Pod eviction, service discovery, resource quotas, and network policies create failures that traditional chaos tools miss.
•LitmusChaos uses the operator pattern — CRDs for experiments, engines, and results integrate naturally with Kubernetes tooling and practices.
•ChaosHub accelerates adoption — Pre-built experiments for common scenarios and the ability to share custom experiments across teams.
•Workflows enable complex scenarios — Argo-based orchestration creates multi-step chaos with conditional logic and parallel execution.
•Probes make chaos scientific — Hypothesis-driven experimentation with measurable outcomes rather than exploratory destruction.
•GitOps integration is natural — CRD-based experiments fit GitOps workflows with version control, review, and automated deployment.
•CNCF incubation indicates maturity — Strong community, active development, and production usage across many organizations.

When to choose LitmusChaos:

Your infrastructure is primarily Kubernetes-based
You want open-source tooling with community support
GitOps is central to your operational model
You need Kubernetes-specific failure modes (pod eviction, service discovery, etc.)
Your team is comfortable with Kubernetes CRDs and operators

What's next:

Page Complete

3 / 5