Loading content...
As Kubernetes deployments grow in complexity—with service meshes, operators, custom controllers, and multi-tenant workloads—the chaos engineering requirements become equally sophisticated. You need more than pod deletion; you need the ability to inject precise failures at specific layers of the stack, at exact moments, with surgical targeting.
Chaos Mesh was built for exactly this level of precision.
Developed by PingCAP (the company behind TiDB, a distributed database) and graduated from the CNCF, Chaos Mesh represents the next evolution of Kubernetes-native chaos. Born from the need to test a complex distributed database system, it brings fine-grained chaos capabilities that go beyond what simpler tools can achieve.
By the end of this page, you will understand Chaos Mesh's architecture and privilege model, master its unique chaos types including JVM chaos and kernel-level injection, learn to design precision chaos experiments with complex selectors, and integrate Chaos Mesh into production workflows with appropriate safety controls.
Chaos Mesh emerged from PingCAP's need to test TiDB, a distributed NewSQL database. TiDB's architecture includes multiple components—TiDB servers, TiKV storage nodes, PD (Placement Driver) for cluster management—each with complex failure modes that simple chaos tools couldn't adequately test.
| Requirement | Why Traditional Tools Fell Short | Chaos Mesh Solution |
|---|---|---|
| Time-sensitive failures | Random timing insufficient for race conditions | Precise scheduling with time-based injection |
| Kernel-level injection | User-space injection doesn't test kernel paths | eBPF-based kernel chaos capabilities |
| JVM-specific failures | Generic process disruption too coarse | Bytecode injection for JVM applications |
| Multi-component scenarios | Single-target focus misses distributed issues | Complex selectors with workflow orchestration |
| IO failure patterns | Generic IO stress insufficient | FUSE-based filesystem fault injection |
| Clock manipulation | Simple NTP changes affect entire system | Per-container time skew isolation |
The precision philosophy:
Chaos Mesh's design philosophy centers on precision. The tool provides fine-grained control over:
Chaos Mesh achieved CNCF Graduated status in 2022, joining projects like Kubernetes, Prometheus, and Envoy. This graduation indicates production readiness, security review completion, and broad community adoption. It's one of only a handful of chaos engineering tools to reach this maturity level.
Chaos Mesh uses a controller-based architecture with privileged components for deep system access. Understanding this architecture—especially the privilege model—is essential for secure deployment.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
┌─────────────────────────────────────────────────────────────────────────┐│ CHAOS MESH ARCHITECTURE │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ CHAOS DASHBOARD (UI) │ ││ │ • Experiment visualization │ ││ │ • Workflow builder │ ││ │ • Event timeline │ ││ │ • Token-based authentication │ ││ └───────────────────────────────────┬────────────────────────────┘ ││ │ ││ ┌───────────────────────────────────▼────────────────────────────┐ ││ │ CHAOS CONTROLLER MANAGER │ ││ │ ┌─────────────────┐ ┌─────────────────────────────────────┐ │ ││ │ │ Chaos │ │ Controllers per Chaos Type │ │ ││ │ │ Daemon Client │ │ • PodChaosController │ │ ││ │ │ │ │ • NetworkChaosController │ │ ││ │ │ │ │ • StressChaosController │ │ ││ │ │ │ │ • IOChaosController │ │ ││ │ │ │ │ • TimeChaosController │ │ ││ │ │ │ │ • JVMChaosController │ │ ││ │ │ │ │ • KernelChaosController │ │ ││ │ └─────────────────┘ └─────────────────────────────────────┘ │ ││ └───────────────────────────────────┬────────────────────────────┘ ││ │ ││ ════════════════════════════════════════════════════════════════════ ││ │ ││ ┌───────────────────────────────────▼────────────────────────────┐ ││ │ CHAOS DAEMON (DaemonSet - Runs on Every Node) │ ││ │ │ ││ │ ┌─────────────────────────────────────────────────────────┐ │ ││ │ │ PRIVILEGED CAPABILITIES │ │ ││ │ │ • CAP_SYS_PTRACE: Process inspection and injection │ │ ││ │ │ • CAP_NET_ADMIN: Network namespace manipulation │ │ ││ │ │ • CAP_SYS_ADMIN: Mount namespace access for IO chaos │ │ ││ │ │ • Host PID namespace: Process visibility across pods │ │ ││ │ │ • Host network: tc/iptables for network chaos │ │ ││ │ └─────────────────────────────────────────────────────────┘ │ ││ │ │ ││ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────┐ │ ││ │ │ Network │ │ Stress │ │ IO │ │ Time │ │ ││ │ │ Chaos │ │ Chaos │ │ Chaos │ │ Chaos │ │ ││ │ │ Injector │ │ Injector │ │ Injector │ │ Injector │ │ ││ │ └────────────┘ └────────────┘ └────────────┘ └──────────┘ │ ││ └──────────────────────────────────────────────────────────────────┘ ││ ││ ┌──────────────────────────────────────────────────────────────────┐ ││ │ TARGET WORKLOADS │ ││ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ ││ │ │ Pods │ │ StatefulSets│ │ Custom Applications │ │ ││ │ │ │ │ │ │ (JVM, etc.) │ │ ││ │ └─────────────┘ └─────────────┘ └─────────────────────────┘ │ ││ └──────────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────────┘Chaos Daemon runs with significant privileges (CAP_SYS_ADMIN, CAP_NET_ADMIN, host namespaces). This is necessary for deep fault injection but requires careful security consideration. Limit Chaos Mesh deployment to dedicated chaos namespaces with strict RBAC. Never deploy in multi-tenant clusters without isolation controls.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
# Chaos Mesh Installation with Helm# Pay attention to security-related values # Basic installationhelm repo add chaos-mesh https://charts.chaos-mesh.orghelm repo update # Production-ready installation with security controlshelm install chaos-mesh chaos-mesh/chaos-mesh \ --namespace chaos-mesh \ --create-namespace \ --set dashboard.securityContext.runAsUser=1000 \ --set dashboard.securityContext.runAsNonRoot=true \ --set controllerManager.enableFilterNamespace=true \ --set controllerManager.targetNamespace=default,staging,production \ --set chaosDaemon.privileged=true \ --set chaosDaemon.capabilities.add[0]=SYS_PTRACE \ --set chaosDaemon.capabilities.add[1]=NET_ADMIN \ --set chaosDaemon.capabilities.add[2]=SYS_ADMIN \ --set dashboard.env.LISTEN_HOST=127.0.0.1 \ --set dashboard.service.type=ClusterIP ---# Restrict chaos to specific namespaces via FilterNamespace# This prevents chaos from affecting critical system namespaces apiVersion: v1kind: ConfigMapmetadata: name: chaos-mesh-config namespace: chaos-meshdata: # Only these namespaces can be targeted filter_namespace: | default staging production qa-testing # These namespaces are always protected protected_namespaces: | kube-system kube-public cert-manager istio-system monitoringChaos Mesh provides an exceptionally rich set of chaos types, going far beyond basic pod and network chaos to include JVM, kernel, and time manipulation.
| Type | Capabilities | Use Cases |
|---|---|---|
| PodChaos | pod-kill, pod-failure, container-kill | Basic pod lifecycle testing |
| NetworkChaos | delay, loss, duplicate, corrupt, partition, bandwidth | Network resilience testing |
| StressChaos | CPU stress, memory stress | Resource exhaustion testing |
| IOChaos | delay, fault, attrOverride | Filesystem and I/O testing |
| TimeChaos | Clock skew (forward/backward) | Time-sensitive logic testing |
| DNSChaos | Error, random responses | Service discovery testing |
| JVMChaos | Exception injection, GC pressure, return value modification | JVM application testing |
| HTTPChaos | Abort, delay, replace, patch | HTTP traffic manipulation |
| KernelChaos | eBPF-based kernel fault injection | Low-level system testing |
| AWSChaos | EC2, EBS chaos | AWS-specific infrastructure testing |
| GCPChaos | VM, disk chaos | GCP-specific testing |
| AzureChaos | VM, disk chaos | Azure-specific testing |
NetworkChaos deep dive:
NetworkChaos is one of Chaos Mesh's most powerful capabilities, offering fine-grained network manipulation:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
# Example: Complex NetworkChaos with precise targetingapiVersion: chaos-mesh.org/v1alpha1kind: NetworkChaosmetadata: name: targeted-network-delay namespace: chaos-meshspec: # Action: delay, loss, duplicate, corrupt, partition, bandwidth action: delay # Delay configuration delay: latency: "100ms" # Base latency jitter: "20ms" # Random variation ±20ms correlation: "25" # 25% correlation with previous packet reorder: reorder: "0.05" # 5% chance of packet reordering correlation: "25" gap: 5 # Precise targeting using selectors selector: namespaces: - production labelSelectors: app: api-gateway expressionSelectors: - { key: tier, operator: In, values: [frontend, middleware] } podPhaseSelectors: - Running # Direction and target direction: to # Affect outgoing traffic # External targets for egress chaos externalTargets: - "payment-provider.example.com" - "192.168.100.0/24" # Target specific destinations within the cluster target: selector: namespaces: - production labelSelectors: app: database mode: all # Affect all matching pods # Duration and scheduling duration: "5m" # Scheduler for recurring experiments scheduler: cron: "0 10 * * 1-5" # Weekdays at 10 AM ---# Network partition between two service groupsapiVersion: chaos-mesh.org/v1alpha1kind: NetworkChaosmetadata: name: service-partition namespace: chaos-meshspec: action: partition # Source: Frontend services selector: namespaces: - production labelSelectors: tier: frontend # Target: Backend services target: selector: namespaces: - production labelSelectors: tier: backend mode: all # Bidirectional partition direction: both duration: "2m"NetworkChaos allows targeting by namespace, labels, pod phase, specific destinations, ports, and even external IPs. This precision means you can test 'what happens when the payment gateway is slow' without affecting any other network traffic.
JVMChaos is one of Chaos Mesh's most distinctive capabilities. It allows injecting faults directly into running JVM applications without modifying source code—using bytecode manipulation via the Chaosblade-exec-jvm agent.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
# Example: JVM Exception InjectionapiVersion: chaos-mesh.org/v1alpha1kind: JVMChaosmetadata: name: payment-service-exception namespace: chaos-meshspec: # Target JVM applications selector: namespaces: - production labelSelectors: app: payment-service # Action type action: exception # Exception injection configuration exception: # Class and method to target class: com.company.payment.PaymentGatewayClient method: processPayment # Exception to throw exception: java.net.SocketTimeoutException # Optional: message for the exception message: "Connection to payment provider timed out" # JVM agent connection port: 9288 # Chaosblade agent port duration: "2m" ---# Example: Method latency injectionapiVersion: chaos-mesh.org/v1alpha1kind: JVMChaosmetadata: name: database-query-latency namespace: chaos-meshspec: selector: namespaces: - production labelSelectors: app: order-service action: latency latency: class: com.company.orders.repository.OrderRepository method: findOrdersByCustomerId # Add 500ms to every call to this method latency: 500 port: 9288 duration: "3m" ---# Example: Return value modificationapiVersion: chaos-mesh.org/v1alpha1kind: JVMChaosmetadata: name: feature-flag-override namespace: chaos-meshspec: selector: namespaces: - staging labelSelectors: app: feature-service action: return return: class: com.company.features.FeatureFlagService method: isFeatureEnabled # Override return value value: "false" # Disable all feature flags port: 9288 duration: "5m" ---# Example: GC pressure to test memory handlingapiVersion: chaos-mesh.org/v1alpha1kind: JVMChaosmetadata: name: gc-pressure-test namespace: chaos-meshspec: selector: namespaces: - staging labelSelectors: app: cache-service action: gc # GC pressure configuration gc: {} # Triggers frequent full GC port: 9288 duration: "5m"JVMChaos requires the Chaosblade JVM agent running in target pods. This is typically done via a sidecar container or by embedding the agent in application startup. Without the agent, JVMChaos experiments will fail to attach.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
# Deployment with JVM chaos agent sidecarapiVersion: apps/v1kind: Deploymentmetadata: name: payment-service namespace: productionspec: replicas: 3 selector: matchLabels: app: payment-service template: metadata: labels: app: payment-service spec: containers: # Main application container - name: payment-service image: company/payment-service:v2.1.0 ports: - containerPort: 8080 env: # Enable JVM agent attachment - name: JAVA_TOOL_OPTIONS value: "-javaagent:/chaosblade/chaosblade-java-agent.jar" volumeMounts: - name: chaosblade-agent mountPath: /chaosblade # Chaosblade agent sidecar - name: chaosblade-agent image: chaosblade/chaosblade-tool:latest command: ["/bin/sh", "-c"] args: - | # Copy agent to shared volume cp /opt/chaosblade/* /chaosblade/ # Start agent server /opt/chaosblade/blade server start --port 9288 # Keep container running tail -f /dev/null ports: - containerPort: 9288 volumeMounts: - name: chaosblade-agent mountPath: /chaosblade volumes: - name: chaosblade-agent emptyDir: {}IOChaos and TimeChaos target foundational system behaviors—filesystem operations and system time—that can reveal deeply hidden bugs in application logic.
IOChaos capabilities:
IOChaos uses FUSE (Filesystem in Userspace) to intercept filesystem operations and inject faults:
| Action | Effect | Use Case |
|---|---|---|
| delay | Add latency to IO operations | Test behavior with slow disk |
| fault (EIO) | Return I/O errors | Test corrupted disk handling |
| fault (ENOSPC) | Return no space errors | Test disk full scenarios |
| attrOverride | Modify file attributes | Test stale stat cache handling |
| mistake | Inject bit flips in read/write | Test data integrity handling |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
# Example: IO latency on database data directoryapiVersion: chaos-mesh.org/v1alpha1kind: IOChaosmetadata: name: postgres-io-delay namespace: chaos-meshspec: action: latency selector: namespaces: - production labelSelectors: app: postgres role: primary # Mounted volume path to affect volumePath: /var/lib/postgresql/data # Target specific file patterns path: "/var/lib/postgresql/data/**/*.dat" # IO latency configuration delay: "100ms" # Percentage of operations to affect percent: 50 # Methods to affect (read, write, both) methods: - read - write duration: "3m" ---# Example: Disk full simulationapiVersion: chaos-mesh.org/v1alpha1kind: IOChaosmetadata: name: log-disk-full namespace: chaos-meshspec: action: fault selector: namespaces: - production labelSelectors: app: api-server volumePath: /var/log # Return ENOSPC (no space) error errno: 28 # ENOSPC # Only affect write operations methods: - write # Affect 100% of writes percent: 100 duration: "2m"TimeChaos:
TimeChaos manipulates the system clock as perceived by specific containers. This is invaluable for testing time-sensitive logic:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
# Example: Time skew for cache expiration testingapiVersion: chaos-mesh.org/v1alpha1kind: TimeChaosmetadata: name: cache-expiration-test namespace: chaos-meshspec: selector: namespaces: - staging labelSelectors: app: cache-service # Time offset (positive = future, negative = past) timeOffset: "2h" # Jump 2 hours into the future # Container-specific (only this container sees modified time) containerNames: - cache-service duration: "10m" ---# Example: Clock skew for distributed systems testingapiVersion: chaos-mesh.org/v1alpha1kind: TimeChaosmetadata: name: clock-skew-partition namespace: chaos-meshspec: selector: namespaces: - production labelSelectors: app: distributed-database # Only affect subset of pods expressionSelectors: - { key: zone, operator: In, values: [us-east-1a] } # Significant clock skew to test consensus protocols timeOffset: "-30s" # 30 seconds in the past # Apply clock drift (gradual change) clockIds: - CLOCK_REALTIME duration: "5m"TimeChaos affects container-level time perception using clock_gettime interception. It doesn't affect the node's actual time. However, applications that use kernel timers or hardware timestamps may not be fully affected by TimeChaos.
Chaos Mesh provides workflow capabilities for orchestrating complex, multi-step chaos experiments with conditional logic and parallel execution.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153
# Example: Complex Chaos WorkflowapiVersion: chaos-mesh.org/v1alpha1kind: Workflowmetadata: name: e2e-resilience-test namespace: chaos-meshspec: entry: main-workflow templates: # Entry point: parallel execution - name: main-workflow templateType: Parallel deadline: 30m children: - network-degradation-path - resource-pressure-path # Path 1: Network degradation sequence - name: network-degradation-path templateType: Serial children: - introduce-latency - wait-for-adaptation - introduce-partition - verify-recovery # Step: Add network latency - name: introduce-latency templateType: NetworkChaos deadline: 5m networkChaos: action: delay delay: latency: "200ms" jitter: "50ms" selector: namespaces: - production labelSelectors: tier: frontend target: selector: namespaces: - production labelSelectors: tier: backend mode: all direction: to duration: "3m" # Wait step - name: wait-for-adaptation templateType: Suspend deadline: 2m suspend: duration: "1m" # Step: Create network partition - name: introduce-partition templateType: NetworkChaos deadline: 5m networkChaos: action: partition selector: namespaces: - production labelSelectors: tier: frontend target: selector: namespaces: - production labelSelectors: tier: backend mode: all direction: both duration: "2m" # Verification step using StatusCheck - name: verify-recovery templateType: StatusCheck deadline: 5m statusCheck: type: HTTP mode: Continuous http: url: http://api-gateway.production.svc:8080/health criteria: statusCode: "200" successThreshold: 3 failureThreshold: 1 intervalSeconds: 10 # Path 2: Resource pressure - name: resource-pressure-path templateType: Serial children: - cpu-stress - memory-stress - name: cpu-stress templateType: StressChaos deadline: 5m stressChaos: selector: namespaces: - production labelSelectors: app: compute-service stressors: cpu: workers: 2 load: 80 duration: "3m" containerNames: - compute-service - name: memory-stress templateType: StressChaos deadline: 5m stressChaos: selector: namespaces: - production labelSelectors: app: compute-service stressors: memory: workers: 2 size: "512Mi" duration: "3m" containerNames: - compute-service ---# Schedule for recurring workflowsapiVersion: chaos-mesh.org/v1alpha1kind: Schedulemetadata: name: weekly-resilience-test namespace: chaos-meshspec: schedule: "0 10 * * 3" # Wednesdays at 10 AM concurrencyPolicy: Forbid # Don't run if previous is still running historyLimit: 5 type: Workflow workflow: # Reference to workflow template workflowSpec: entry: main-workflow templates: # ... (same as above)Chaos Mesh integrates with the cloud-native observability ecosystem, providing visibility into chaos experiments and their impact.
| Integration | Purpose | Configuration |
|---|---|---|
| Prometheus | Chaos metrics scraping | ServiceMonitor CRD |
| Grafana | Chaos dashboards | Pre-built dashboard JSON |
| Kubernetes Events | Chaos event logging | Built-in |
| DataDog | Metrics forwarding | Prometheus remote write |
| ArgoCD | GitOps deployment | Native Kubernetes CRDs |
| GitHub Actions | CI/CD chaos gates | kubectl + Chaos Mesh CRDs |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
# ServiceMonitor for Chaos Mesh metricsapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: chaos-mesh-controller namespace: monitoringspec: selector: matchLabels: app.kubernetes.io/component: controller-manager app.kubernetes.io/instance: chaos-mesh namespaceSelector: matchNames: - chaos-mesh endpoints: - port: http path: /metrics interval: 15s ---# Grafana dashboard ConfigMap (simplified)apiVersion: v1kind: ConfigMapmetadata: name: chaos-mesh-dashboard namespace: monitoring labels: grafana_dashboard: "true"data: chaos-mesh.json: | { "title": "Chaos Mesh Overview", "panels": [ { "title": "Active Experiments", "targets": [ { "expr": "chaos_mesh_experiments{status='running'}" } ] }, { "title": "Experiment Success Rate", "targets": [ { "expr": "sum(chaos_mesh_experiments{status='succeeded'}) / sum(chaos_mesh_experiments) * 100" } ] }, { "title": "Experiments by Type", "targets": [ { "expr": "sum by (kind) (chaos_mesh_experiments)" } ] } ] } ---# Alert rules for chaos eventsapiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: chaos-mesh-alerts namespace: monitoringspec: groups: - name: chaos-mesh rules: - alert: ChaosExperimentRunning expr: chaos_mesh_experiments{status='running'} > 0 for: 0m labels: severity: info annotations: summary: "Active chaos experiment detected" description: "{{ $labels.name }} in {{ $labels.namespace }} is running" - alert: ChaosExperimentFailed expr: increase(chaos_mesh_experiments{status='failed'}[5m]) > 0 for: 0m labels: severity: warning annotations: summary: "Chaos experiment failed" description: "Experiment {{ $labels.name }} failed" - alert: UnexpectedChaosOutsideWindow expr: | chaos_mesh_experiments{status='running'} > 0 and on() hour() < 9 or hour() > 17 for: 1m labels: severity: critical annotations: summary: "Chaos running outside business hours" description: "Chaos experiments should only run 9-17. Investigate immediately."Configure Grafana to overlay chaos events on application dashboards. When reviewing performance issues, seeing 'Chaos started here' annotations immediately identifies chaos-induced anomalies versus genuine production issues.
Chaos Mesh brings precision and depth to Kubernetes chaos engineering, offering capabilities that go beyond basic pod and network disruption to include JVM-level, kernel-level, and time-based fault injection.
When to choose Chaos Mesh:
What's next:
We've explored open-source and Kubernetes-native chaos tools. But what about organizations deeply invested in AWS? In the next page, we'll explore AWS Fault Injection Simulator, Amazon's native chaos engineering service designed specifically for AWS infrastructure.
You now understand Chaos Mesh's precision-focused architecture, its rich set of chaos types including JVM and kernel-level injection, workflow capabilities, and integration with the cloud-native observability stack. Chaos Mesh represents the state of the art in fine-grained Kubernetes chaos engineering.