Loading learning content...
Kubernetes has become the de facto standard for container orchestration, transforming how organizations deploy and manage applications. But Kubernetes introduces its own complexity: pods, deployments, services, ingresses, operators, custom resources—layers upon layers of abstraction that create new failure modes traditional chaos tools weren't designed to address.
LitmusChaos emerged to bring chaos engineering natively into the Kubernetes ecosystem.
Developed by MayaData (now part of DataCore) and donated to the Cloud Native Computing Foundation (CNCF), LitmusChaos is built on Kubernetes primitives from the ground up. It speaks Kubernetes-native: Custom Resource Definitions (CRDs), operators, and declarative YAML—the language platform engineers already know.
By the end of this page, you will understand LitmusChaos architecture and operator model, master chaos experiments and workflows using CRDs, learn to leverage ChaosHub for community-sourced experiments, and integrate LitmusChaos into GitOps and CI/CD pipelines.
Traditional chaos engineering tools like Chaos Monkey were designed for virtual machines and bare-metal servers. While they can work in Kubernetes environments (by targeting nodes or pods as VMs), they miss the abstractions that make Kubernetes unique—and where many failures actually occur.
| Failure Mode | Traditional Chaos Approach | Kubernetes-Native Approach |
|---|---|---|
| Pod eviction | Terminate container process | Evict via Kubernetes API, test rescheduling |
| Service discovery failure | Block DNS | Delete Endpoints, test service mesh recovery |
| Resource quota exhaustion | Consume resources | Create competing pods, test scheduler behavior |
| Network policy misconfiguration | Block network | Validate NetworkPolicy failures |
| Persistent volume issues | Fill disk | Detach PV, test StatefulSet recovery |
| Node drain | Shutdown VM | kubectl drain, test pod migration |
The Kubernetes failure surface:
Kubernetes adds multiple abstraction layers, each with potential failure points:
12345678910111213141516171819202122232425262728293031323334353637383940
┌─────────────────────────────────────────────────────────────────────┐│ KUBERNETES FAILURE SURFACE │├─────────────────────────────────────────────────────────────────────┤│ ││ Layer 1: APPLICATION ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Container crashes, OOM kills, application errors │ ││ │ Health check failures, readiness probe issues │ ││ └─────────────────────────────────────────────────────────────┘ ││ ││ Layer 2: POD ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Pod eviction, pod preemption, init container failures │ ││ │ Sidecar issues, volume mount failures │ ││ └─────────────────────────────────────────────────────────────┘ ││ ││ Layer 3: WORKLOAD CONTROLLERS ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Deployment rollout failures, ReplicaSet issues │ ││ │ StatefulSet ordering violations, DaemonSet gaps │ ││ └─────────────────────────────────────────────────────────────┘ ││ ││ Layer 4: SERVICES & NETWORKING ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Service routing failures, Ingress misconfigurations │ ││ │ NetworkPolicy blocks, DNS resolution failures │ ││ └─────────────────────────────────────────────────────────────┘ ││ ││ Layer 5: CLUSTER INFRASTRUCTURE ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Node failures, etcd issues, API server unavailability │ ││ │ Scheduler failures, controller-manager issues │ ││ └─────────────────────────────────────────────────────────────┘ ││ ││ Layer 6: CLOUD PROVIDER ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Load balancer issues, PersistentVolume provisioning │ ││ │ Cloud controller failures, CSI driver problems │ ││ └─────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────┘A Kubernetes-native chaos tool can target Kubernetes abstractions directly—Deployments, Services, StatefulSets—rather than requiring users to translate Kubernetes concepts into infrastructure primitives. This reduces cognitive overhead and catches failures that only manifest at the Kubernetes layer.
LitmusChaos follows the Kubernetes operator pattern, using Custom Resource Definitions (CRDs) to represent chaos experiments as native Kubernetes objects. This architecture enables GitOps workflows, Kubernetes RBAC integration, and familiar kubectl-based operations.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
┌─────────────────────────────────────────────────────────────────────┐│ LITMUSCHAOS ARCHITECTURE │├─────────────────────────────────────────────────────────────────────┤│ ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ LITMUS PORTAL (UI) │ ││ │ ┌────────────┐ ┌────────────┐ ┌─────────────────────────┐ │ ││ │ │ Experiment │ │ Workflow │ │ Analytics & │ │ ││ │ │ Designer │ │ Builder │ │ Observability │ │ ││ │ └────────────┘ └────────────┘ └─────────────────────────┘ │ ││ └─────────────────────────┬────────────────────────────────────┘ ││ │ ││ ┌─────────────────────────▼────────────────────────────────────┐ ││ │ CHAOS CENTER (Backend) │ ││ │ ┌────────────┐ ┌────────────┐ ┌─────────────────────────┐ │ ││ │ │ GraphQL │ │ Mongo DB │ │ Authentication & │ │ ││ │ │ API │ │ (State) │ │ Authorization │ │ ││ │ └────────────┘ └────────────┘ └─────────────────────────┘ │ ││ └─────────────────────────┬────────────────────────────────────┘ ││ │ ││ ════════════════════════════════════════════════════════════════ ││ CLUSTER BOUNDARY ││ ════════════════════════════════════════════════════════════════ ││ │ ││ ┌─────────────────────────▼────────────────────────────────────┐ ││ │ SUBSCRIBER (Cluster Agent) │ ││ │ • Connects cluster to Chaos Center │ ││ │ • Receives workflow execution requests │ ││ │ • Reports experiment results back │ ││ └─────────────────────────┬────────────────────────────────────┘ ││ │ ││ ┌─────────────────────────▼────────────────────────────────────┐ ││ │ CHAOS OPERATOR (Controller) │ ││ │ • Watches ChaosEngine CRDs │ ││ │ • Orchestrates chaos experiments │ ││ │ • Manages experiment lifecycle │ ││ └────────┬─────────────────────────────────────┬───────────────┘ ││ │ │ ││ ┌────────▼────────┐ ┌──────────▼──────────────┐ ││ │ CHAOS RUNNER │ │ CHAOS EXPORTER │ ││ │ (Experiment │ │ (Prometheus metrics) │ ││ │ Executor) │ │ │ ││ └────────┬────────┘ └─────────────────────────┘ ││ │ ││ ┌────────▼────────────────────────────────────────────────────┐ ││ │ TARGET APPLICATION / INFRASTRUCTURE │ ││ │ Pods, Deployments, StatefulSets, Nodes, etc. │ ││ └──────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────┘Custom Resource Definitions:
LitmusChaos defines several CRDs that represent the chaos primitives:
| CRD | Purpose | Lifecycle |
|---|---|---|
| ChaosExperiment | Defines a specific chaos experiment type (template) | Created once, reused across engines |
| ChaosEngine | Binds experiment to target application | Created per experiment run |
| ChaosResult | Records experiment outcome and observations | Created automatically by engine |
| ChaosSchedule | Enables scheduled/recurring experiments | Long-lived, triggers engines |
| Workflow | Argo-based multi-step chaos workflow | Complex experiment orchestration |
By leveraging the Kubernetes operator pattern, LitmusChaos gains automatic reconciliation, declarative management, and native integration with Kubernetes RBAC, namespacing, and resource quotas. Experiments become just another Kubernetes resource to manage.
LitmusChaos provides a rich library of pre-built chaos experiments targeting different layers of the Kubernetes stack. Understanding these experiments is key to designing effective resilience tests.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
# Example: Pod Delete Chaos ExperimentapiVersion: litmuschaos.io/v1alpha1kind: ChaosExperimentmetadata: name: pod-delete namespace: litmusspec: definition: # Scope and permissions scope: Namespaced permissions: - apiGroups: [""] resources: ["pods"] verbs: ["create", "delete", "get", "list", "patch", "update"] - apiGroups: ["apps"] resources: ["deployments", "replicasets"] verbs: ["get", "list"] - apiGroups: ["litmuschaos.io"] resources: ["chaosengines", "chaosexperiments", "chaosresults"] verbs: ["create", "get", "list", "patch", "update"] # Experiment image image: litmuschaos/go-runner:latest imagePullPolicy: Always # Arguments and environment args: - -c - ./experiments -name pod-delete command: - /bin/bash env: - name: TOTAL_CHAOS_DURATION value: "30" - name: CHAOS_INTERVAL value: "10" - name: FORCE value: "false" - name: TARGET_PODS value: "" # Empty means random selection - name: PODS_AFFECTED_PERC value: "50" # Affect 50% of matching pods - name: RAMP_TIME value: "0" - name: SEQUENCE value: "parallel" ---# ChaosEngine binds experiment to target applicationapiVersion: litmuschaos.io/v1alpha1kind: ChaosEnginemetadata: name: nginx-pod-delete-chaos namespace: productionspec: # Enable engine engineState: active # Target application appinfo: appns: production applabel: app=nginx appkind: deployment # Experiment selection experiments: - name: pod-delete spec: components: env: - name: TOTAL_CHAOS_DURATION value: "60" - name: CHAOS_INTERVAL value: "15" - name: FORCE value: "true" - name: PODS_AFFECTED_PERC value: "30" # Probes define success criteria probe: - name: healthcheck type: httpProbe mode: Continuous runProperties: probeTimeout: 5 retry: 1 interval: 5 httpProbe/inputs: url: http://nginx-service:80/health method: get: criteria: == responseCode: "200" # Cleanup after experiment annotationCheck: "true" chaosServiceAccount: pod-delete-saEach chaos experiment requires specific Kubernetes RBAC permissions. The ChaosServiceAccount specified in the ChaosEngine must have appropriate roles bound. This is a common source of experiment failures—always verify RBAC setup before troubleshooting experiment logic.
ChaosHub is LitmusChaos's experiment repository—a marketplace of community-contributed and officially maintained chaos experiments. It dramatically accelerates chaos adoption by providing ready-to-use experiments for common failure scenarios.
| Category | Experiments Available | Use Cases |
|---|---|---|
| Generic | pod-delete, pod-cpu-hog, network-chaos | Universal Kubernetes testing |
| AWS | ec2-terminate, ebs-loss, az-chaos | AWS-specific infrastructure chaos |
| GCP | gcp-vm-disk-loss, gcp-vm-instance-stop | GCP-specific chaos |
| Azure | azure-disk-loss, azure-instance-stop | Azure-specific chaos |
| Kafka | kafka-broker-pod-failure, kafka-disk-failure | Kafka resilience testing |
| Cassandra | cassandra-pod-delete, cassandra-repair | Cassandra cluster testing |
Private ChaosHubs:
Organizations can create private ChaosHubs containing custom experiments tailored to their applications and failure modes. This enables sharing chaos knowledge across teams without exposing internal details publicly.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
# Example: Custom ChaosHub Experiment for Internal Payment ServiceapiVersion: litmuschaos.io/v1alpha1kind: ChaosExperimentmetadata: name: payment-service-latency namespace: litmus labels: chaoshub: internal app.kubernetes.io/category: payment app.kubernetes.io/domain: fintechspec: definition: scope: Namespaced permissions: - apiGroups: [""] resources: ["pods", "services"] verbs: ["get", "list"] - apiGroups: ["apps"] resources: ["deployments"] verbs: ["get", "list"] - apiGroups: ["litmuschaos.io"] resources: ["chaosengines", "chaosresults"] verbs: ["create", "get", "list", "patch", "update"] image: internal-registry/chaos-experiments:payment-v1.2.0 # Custom experiment logic args: - -c - | # Inject latency into payment service dependencies ./inject-latency \ --target-service payment-gateway \ --upstream-latency-ms ${UPSTREAM_LATENCY_MS} \ --downstream-latency-ms ${DOWNSTREAM_LATENCY_MS} \ --duration ${CHAOS_DURATION} command: - /bin/bash env: # Payment-specific parameters - name: UPSTREAM_LATENCY_MS value: "500" - name: DOWNSTREAM_LATENCY_MS value: "200" - name: CHAOS_DURATION value: "120" # Safety parameters - name: ABORT_ON_ERROR_RATE value: "5" # Abort if error rate exceeds 5% - name: MONITOR_INTERVAL value: "10" # Notification settings - name: SLACK_WEBHOOK valueFrom: secretKeyRef: name: chaos-secrets key: slack-webhook # Labels for discovery labels: experiment: payment-service-latency tier: critical owner: payments-team # ConfigMap for additional configuration configMaps: - name: payment-chaos-config mountPath: /etc/chaos ---# Usage documentation embedded in experimentapiVersion: v1kind: ConfigMapmetadata: name: payment-chaos-config namespace: litmusdata: README.md: | # Payment Service Latency Experiment ## Purpose Tests the payment service's behavior when upstream payment providers experience latency. Validates circuit breaker configuration and fallback to cached authorization responses. ## Prerequisites - Payment service deployed with circuit breaker enabled - Fallback cache warmed with test authorization data - Monitoring dashboard open during experiment ## Expected Behavior - Circuit breaker should open at 500ms sustained latency - Fallback cache should serve pre-authorized transactions - Error rate should not exceed 2% for cached operations ## Rollback Procedure 1. Halt ChaosEngine: kubectl delete chaosengine payment-latency-test 2. All latency injection stops automatically 3. Circuit breakers will close after recovery period (30s default)Custom ChaosHub experiments should include comprehensive documentation—purpose, prerequisites, expected behavior, and rollback procedures. This documentation becomes tribal knowledge that scales chaos practices across teams without requiring real-time expert involvement.
LitmusChaos workflows orchestrate multiple chaos experiments into coherent test scenarios. Built on Argo Workflows, they enable complex, multi-step chaos with conditional logic, parallel execution, and integrated observability.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202
# Example: Comprehensive Service Resilience WorkflowapiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: name: service-resilience-test namespace: litmusspec: entrypoint: resilience-test-dag serviceAccountName: litmus-admin # Workflow-level arguments arguments: parameters: - name: target-namespace value: production - name: target-deployment value: api-gateway - name: chaos-duration value: "120" templates: # DAG orchestration template - name: resilience-test-dag dag: tasks: # Step 1: Pre-chaos validation - name: pre-chaos-health-check template: health-check arguments: parameters: - name: phase value: pre-chaos # Step 2a: Pod chaos (in parallel with Step 2b) - name: pod-delete-chaos template: run-pod-delete depends: pre-chaos-health-check # Step 2b: Network chaos (in parallel with Step 2a) - name: network-latency-chaos template: run-network-latency depends: pre-chaos-health-check # Step 3: Wait for both chaos experiments - name: mid-chaos-health-check template: health-check depends: "pod-delete-chaos && network-latency-chaos" arguments: parameters: - name: phase value: mid-chaos # Step 4: Node-level chaos (escalation) - name: node-cpu-chaos template: run-node-cpu-hog depends: mid-chaos-health-check # Only run if mid-chaos check passed when: "{{steps.mid-chaos-health-check.outputs.result}} == passed" # Step 5: Recovery validation - name: post-chaos-health-check template: health-check depends: node-cpu-chaos arguments: parameters: - name: phase value: post-chaos # Step 6: Generate report - name: generate-report template: chaos-report depends: post-chaos-health-check # Health check template - name: health-check inputs: parameters: - name: phase container: image: internal-registry/chaos-tools:latest command: ["/bin/bash", "-c"] args: - | echo "Running health check: {{inputs.parameters.phase}}" # Check deployment availability AVAILABLE=$(kubectl get deployment \ {{workflow.parameters.target-deployment}} \ -n {{workflow.parameters.target-namespace}} \ -o jsonpath='{.status.availableReplicas}') DESIRED=$(kubectl get deployment \ {{workflow.parameters.target-deployment}} \ -n {{workflow.parameters.target-namespace}} \ -o jsonpath='{.spec.replicas}') # Check error rate from Prometheus ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~"5..",deployment="{{workflow.parameters.target-deployment}}"}[5m])" | jq -r '.data.result[0].value[1]') echo "Available: $AVAILABLE, Desired: $DESIRED, Error Rate: $ERROR_RATE" # Validation logic if [ "$AVAILABLE" -ge "$((DESIRED - 1))" ] && [ $(echo "$ERROR_RATE < 0.05" | bc -l) -eq 1 ]; then echo "passed" > /tmp/result else echo "failed" > /tmp/result fi outputs: parameters: - name: result valueFrom: path: /tmp/result # Pod delete experiment template - name: run-pod-delete container: image: litmuschaos/litmus-operator:latest args: - -c - | kubectl apply -f - <<EOF apiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: pod-delete-engine namespace: {{workflow.parameters.target-namespace}} spec: engineState: active appinfo: appns: {{workflow.parameters.target-namespace}} applabel: app={{workflow.parameters.target-deployment}} appkind: deployment experiments: - name: pod-delete spec: components: env: - name: TOTAL_CHAOS_DURATION value: "{{workflow.parameters.chaos-duration}}" - name: PODS_AFFECTED_PERC value: "30" EOF # Wait for experiment completion while true; do STATUS=$(kubectl get chaosengine pod-delete-engine \ -n {{workflow.parameters.target-namespace}} \ -o jsonpath='{.status.engineStatus}') if [ "$STATUS" == "completed" ]; then break fi sleep 10 done command: - /bin/bash # Network latency experiment template - name: run-network-latency container: image: litmuschaos/litmus-operator:latest args: - -c - | kubectl apply -f - <<EOF apiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: network-latency-engine namespace: {{workflow.parameters.target-namespace}} spec: engineState: active appinfo: appns: {{workflow.parameters.target-namespace}} applabel: app={{workflow.parameters.target-deployment}} appkind: deployment experiments: - name: pod-network-latency spec: components: env: - name: TOTAL_CHAOS_DURATION value: "{{workflow.parameters.chaos-duration}}" - name: NETWORK_LATENCY value: "200" - name: CONTAINER_RUNTIME value: containerd EOF # Wait for completion while true; do STATUS=$(kubectl get chaosengine network-latency-engine \ -n {{workflow.parameters.target-namespace}} \ -o jsonpath='{.status.engineStatus}') if [ "$STATUS" == "completed" ]; then break fi sleep 10 done command: - /bin/bash # Additional templates for node-cpu-hog and report generation...Workflows unlock testing scenarios impossible with single experiments: cascade failures, progressive stress testing, recovery validation, and conditional chaos based on system behavior during the experiment.
LitmusChaos probes enable hypothesis-driven chaos engineering by validating system behavior during experiments. They transform chaos from 'break things and see what happens' into scientific experiments with measurable outcomes.
| Probe Type | Mechanism | Use Case |
|---|---|---|
| httpProbe | HTTP request to endpoint | Service health endpoints, API availability |
| cmdProbe | Execute shell command | Custom validation scripts, database queries |
| k8sProbe | Kubernetes API checks | Resource existence, status validation |
| promProbe | Prometheus query | Metric-based validation, SLO verification |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
# Example: ChaosEngine with Comprehensive ProbesapiVersion: litmuschaos.io/v1alpha1kind: ChaosEnginemetadata: name: payment-service-chaos namespace: productionspec: engineState: active appinfo: appns: production applabel: app=payment-service appkind: deployment experiments: - name: pod-delete spec: probe: # HTTP Probe: Verify service is responding - name: payment-health-probe type: httpProbe mode: Continuous runProperties: probeTimeout: 5 retry: 3 interval: 2 probePollingInterval: 1 initialDelaySeconds: 3 httpProbe/inputs: url: http://payment-service.production.svc:8080/health method: get: criteria: == responseCode: "200" # Prometheus Probe: Verify error rate SLO - name: error-rate-slo-probe type: promProbe mode: Edge # Check at start and end runProperties: probeTimeout: 10 retry: 2 interval: 5 promProbe/inputs: endpoint: http://prometheus.monitoring.svc:9090 query: | sum(rate(http_requests_total{service="payment-service",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="payment-service"}[5m])) * 100 comparator: type: float criteria: <= value: "1.0" # Error rate <= 1% # Kubernetes Probe: Verify minimum replicas - name: replica-count-probe type: k8sProbe mode: Continuous runProperties: probeTimeout: 5 retry: 2 interval: 3 k8sProbe/inputs: group: apps version: v1 resource: deployments namespace: production fieldSelector: metadata.name=payment-service operation: present # Command Probe: Custom database connectivity check - name: database-connectivity-probe type: cmdProbe mode: OnChaos runProperties: probeTimeout: 30 retry: 1 interval: 10 cmdProbe/inputs: command: | /bin/bash -c ' # Check database connection from a test pod kubectl run db-test --rm -i --restart=Never \ --image=postgres:13 \ --namespace=production \ -- psql -h postgres -U app -c "SELECT 1" || exit 1 echo "Database connectivity verified" ' source: inline comparator: type: string criteria: contains value: "Database connectivity verified" # Annotation for Prometheus metrics monitoringEnabled: true # Cleanup policy chaosServiceAccount: litmus-admin annotationCheck: "false"Observability integration:
LitmusChaos exports metrics to Prometheus, enabling integration with existing observability stacks:
123456789101112131415161718192021222324252627282930313233343536373839404142
# Prometheus alerts for LitmusChaos eventsgroups: - name: litmus-chaos-alerts rules: # Alert when chaos experiments fail - alert: ChaosExperimentFailed expr: litmuschaos_experiment_verdict{verdict="Fail"} > 0 for: 0m labels: severity: warning annotations: summary: "Chaos experiment failed: {{ $labels.chaosengine_name }}" description: | Experiment {{ $labels.experiment_name }} in namespace {{ $labels.namespace }} failed. This may indicate a resilience gap in {{ $labels.app_label }}. runbook: https://wiki.company.com/chaos/experiment-failures # Alert when probe failures occur during chaos - alert: ChaosProbeFailure expr: litmuschaos_probe_success_percentage < 100 for: 0m labels: severity: warning annotations: summary: "Chaos probe failures detected" description: | Probe {{ $labels.probe_name }} had failures during chaos experiment {{ $labels.chaosengine_name }}. Success rate: {{ $value }}% # Alert on unusually high chaos activity - alert: HighChaosActivity expr: sum(rate(litmuschaos_experiment_total[1h])) > 10 for: 5m labels: severity: info annotations: summary: "High chaos engineering activity detected" description: | More than 10 chaos experiments triggered in the last hour. Verify this is expected activity.Probes encode your hypothesis: 'I expect error rate to stay below 1% during pod deletion.' If probes pass, your hypothesis is validated. If they fail, you've discovered a resilience gap. This transforms chaos from exploratory destruction into structured experimentation.
LitmusChaos's CRD-based architecture naturally fits GitOps workflows. Chaos experiments become versioned, reviewed, and deployed alongside application code.
Chaos-as-Code benefits:
123456789101112131415161718192021222324252627282930313233
# GitOps Repository Structure for Chaos-as-Code chaos-experiments/├── base/ # Shared experiment definitions│ ├── experiments/│ │ ├── pod-delete.yaml│ │ ├── network-latency.yaml│ │ └── node-cpu-hog.yaml│ └── kustomization.yaml│├── environments/│ ├── staging/ # Staging-specific configuration│ │ ├── engines/│ │ │ ├── api-gateway-chaos.yaml│ │ │ └── payment-service-chaos.yaml│ │ ├── schedules/│ │ │ └── daily-resilience-test.yaml│ │ └── kustomization.yaml│ ││ └── production/ # Production-specific configuration│ ├── engines/│ │ ├── api-gateway-chaos.yaml│ │ └── payment-service-chaos.yaml│ ├── schedules/│ │ └── weekly-gameday.yaml│ └── kustomization.yaml│├── workflows/ # Complex multi-step workflows│ ├── service-resilience-test.yaml│ ├── database-failover-test.yaml│ └── full-stack-chaos.yaml│└── README.md # Documentation and runbooks1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
# Argo CD Application for Chaos ExperimentsapiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: chaos-experiments-production namespace: argocdspec: project: chaos-engineering source: repoURL: https://github.com/company/chaos-experiments.git targetRevision: main path: environments/production # Kustomize for environment-specific patches kustomize: namePrefix: prod- commonLabels: environment: production managed-by: argocd destination: server: https://kubernetes.default.svc namespace: litmus syncPolicy: automated: prune: true # Remove experiments no longer in Git selfHeal: true # Revert manual changes syncOptions: - CreateNamespace=true # Retry on transient failures retry: limit: 5 backoff: duration: 5s factor: 2 maxDuration: 3m ---# Argo CD Project with appropriate permissionsapiVersion: argoproj.io/v1alpha1kind: AppProjectmetadata: name: chaos-engineering namespace: argocdspec: description: Chaos Engineering experiments and workflows # Limit which clusters can receive chaos destinations: - namespace: litmus server: https://kubernetes.default.svc - namespace: staging server: https://kubernetes.default.svc - namespace: production server: https://kubernetes.default.svc # Limit which repos can define chaos sourceRepos: - https://github.com/company/chaos-experiments.git # Cluster resources chaos can manage clusterResourceWhitelist: - group: "" kind: Namespace - group: litmuschaos.io kind: "*" # Namespace resources namespaceResourceWhitelist: - group: "" kind: ServiceAccount - group: rbac.authorization.k8s.io kind: Role - group: rbac.authorization.k8s.io kind: RoleBinding - group: litmuschaos.io kind: "*"LitmusChaos brings chaos engineering natively into the Kubernetes ecosystem, speaking the language platform engineers already know: CRDs, operators, and declarative YAML.
When to choose LitmusChaos:
What's next:
LitmusChaos excels in Kubernetes environments. But what about chaos specifically designed for cloud-native service meshes and advanced Kubernetes patterns? In the next page, we'll explore Chaos Mesh, another CNCF project focused on fine-grained chaos for complex Kubernetes deployments.
You now understand LitmusChaos's Kubernetes-native architecture, experiment types, ChaosHub ecosystem, workflows, probes, and GitOps integration. LitmusChaos brings chaos engineering into the cloud-native stack as a first-class citizen, enabling platform engineers to test resilience using the same tools and patterns they use for everything else.