Chaos Tools - Learning Module

Loading content...

0/273

Chaos Mesh: Fine-Grained Kubernetes Chaos

Precision Chaos for Complex Systems

As Kubernetes deployments grow in complexity—with service meshes, operators, custom controllers, and multi-tenant workloads—the chaos engineering requirements become equally sophisticated. You need more than pod deletion; you need the ability to inject precise failures at specific layers of the stack, at exact moments, with surgical targeting.

Chaos Mesh was built for exactly this level of precision.

Developed by PingCAP (the company behind TiDB, a distributed database) and graduated from the CNCF, Chaos Mesh represents the next evolution of Kubernetes-native chaos. Born from the need to test a complex distributed database system, it brings fine-grained chaos capabilities that go beyond what simpler tools can achieve.

What You Will Learn

By the end of this page, you will understand Chaos Mesh's architecture and privilege model, master its unique chaos types including JVM chaos and kernel-level injection, learn to design precision chaos experiments with complex selectors, and integrate Chaos Mesh into production workflows with appropriate safety controls.

Origins and Philosophy

Chaos Mesh emerged from PingCAP's need to test TiDB, a distributed NewSQL database. TiDB's architecture includes multiple components—TiDB servers, TiKV storage nodes, PD (Placement Driver) for cluster management—each with complex failure modes that simple chaos tools couldn't adequately test.

Testing Requirements That Drove Chaos Mesh Development
Requirement	Why Traditional Tools Fell Short	Chaos Mesh Solution
Time-sensitive failures	Random timing insufficient for race conditions	Precise scheduling with time-based injection
Kernel-level injection	User-space injection doesn't test kernel paths	eBPF-based kernel chaos capabilities
JVM-specific failures	Generic process disruption too coarse	Bytecode injection for JVM applications
Multi-component scenarios	Single-target focus misses distributed issues	Complex selectors with workflow orchestration
IO failure patterns	Generic IO stress insufficient	FUSE-based filesystem fault injection
Clock manipulation	Simple NTP changes affect entire system	Per-container time skew isolation

The precision philosophy:

Chaos Mesh's design philosophy centers on precision. The tool provides fine-grained control over:

What fails — Exact processes, containers, pods, or nodes
How it fails — Specific failure modes at the kernel, JVM, network, or application layer
When it fails — Precise timing and duration of chaos injection
How much it fails — Percentage of operations affected, rather than all-or-nothing
For whom it fails — Selective targeting by source, destination, protocol, or other attributes

CNCF Graduation

Chaos Mesh achieved CNCF Graduated status in 2022, joining projects like Kubernetes, Prometheus, and Envoy. This graduation indicates production readiness, security review completion, and broad community adoption. It's one of only a handful of chaos engineering tools to reach this maturity level.

Chaos Mesh Architecture

Chaos Mesh uses a controller-based architecture with privileged components for deep system access. Understanding this architecture—especially the privilege model—is essential for secure deployment.

Chaos Mesh Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
┌─────────────────────────────────────────────────────────────────────────┐
│                      CHAOS MESH ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────┐    │
│  │                   CHAOS DASHBOARD (UI)                          │    │
│  │  • Experiment visualization                                     │    │
│  │  • Workflow builder                                             │    │
│  │  • Event timeline                                               │    │
│  │  • Token-based authentication                                   │    │
│  └───────────────────────────────────┬────────────────────────────┘    │
│                                      │                                  │
│  ┌───────────────────────────────────▼────────────────────────────┐    │
│  │                CHAOS CONTROLLER MANAGER                         │    │
│  │  ┌─────────────────┐  ┌─────────────────────────────────────┐  │    │
│  │  │  Chaos          │  │  Controllers per Chaos Type          │  │    │
│  │  │  Daemon Client  │  │  • PodChaosController                │  │    │
│  │  │                 │  │  • NetworkChaosController            │  │    │
│  │  │                 │  │  • StressChaosController             │  │    │
│  │  │                 │  │  • IOChaosController                 │  │    │
│  │  │                 │  │  • TimeChaosController               │  │    │
│  │  │                 │  │  • JVMChaosController                │  │    │
│  │  │                 │  │  • KernelChaosController             │  │    │
│  │  └─────────────────┘  └─────────────────────────────────────┘  │    │
│  └───────────────────────────────────┬────────────────────────────┘    │
│                                      │                                  │
│  ════════════════════════════════════════════════════════════════════  │
│                                      │                                  │
│  ┌───────────────────────────────────▼────────────────────────────┐    │
│  │           CHAOS DAEMON (DaemonSet - Runs on Every Node)        │    │
│  │                                                                  │    │
│  │  ┌─────────────────────────────────────────────────────────┐   │    │
│  │  │                 PRIVILEGED CAPABILITIES                  │   │    │
│  │  │  • CAP_SYS_PTRACE: Process inspection and injection     │   │    │
│  │  │  • CAP_NET_ADMIN: Network namespace manipulation        │   │    │
│  │  │  • CAP_SYS_ADMIN: Mount namespace access for IO chaos   │   │    │
│  │  │  • Host PID namespace: Process visibility across pods   │   │    │
│  │  │  • Host network: tc/iptables for network chaos          │   │    │
│  │  └─────────────────────────────────────────────────────────┘   │    │
│  │                                                                  │    │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌──────────┐ │    │
│  │  │   Network  │  │   Stress   │  │     IO     │  │   Time   │ │    │
│  │  │   Chaos    │  │   Chaos    │  │   Chaos    │  │   Chaos  │ │    │
│  │  │  Injector  │  │  Injector  │  │  Injector  │  │ Injector │ │    │
│  │  └────────────┘  └────────────┘  └────────────┘  └──────────┘ │    │
│  └──────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                      TARGET WORKLOADS                             │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │  │
│  │  │    Pods     │  │ StatefulSets│  │  Custom Applications    │  │  │
│  │  │             │  │             │  │  (JVM, etc.)            │  │  │
│  │  └─────────────┘  └─────────────┘  └─────────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘

Architectural Components

•Chaos Dashboard — Web UI for experiment management, visualization, and workflow design. Provides token-based authentication and RBAC integration.
•Chaos Controller Manager — Set of Kubernetes controllers, one per chaos type. Watches CRDs and coordinates with Chaos Daemon for execution.
•Chaos Daemon — Privileged DaemonSet that runs on every node. Performs the actual fault injection using Linux kernel capabilities.
•DNS Server — Optional component for DNS chaos, providing controlled resolution failures.
•Chaos Daemon Sidecar — For environments where DaemonSet privileges are restricted, sidecar injection can provide some chaos capabilities.

Privilege Considerations

Chaos Daemon runs with significant privileges (CAP_SYS_ADMIN, CAP_NET_ADMIN, host namespaces). This is necessary for deep fault injection but requires careful security consideration. Limit Chaos Mesh deployment to dedicated chaos namespaces with strict RBAC. Never deploy in multi-tenant clusters without isolation controls.

chaos_mesh_installation.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Chaos Mesh Installation with Helm
# Pay attention to security-related values
 
# Basic installation
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
 
# Production-ready installation with security controls
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --create-namespace \
  --set dashboard.securityContext.runAsUser=1000 \
  --set dashboard.securityContext.runAsNonRoot=true \
  --set controllerManager.enableFilterNamespace=true \
  --set controllerManager.targetNamespace=default,staging,production \
  --set chaosDaemon.privileged=true \
  --set chaosDaemon.capabilities.add[0]=SYS_PTRACE \
  --set chaosDaemon.capabilities.add[1]=NET_ADMIN \
  --set chaosDaemon.capabilities.add[2]=SYS_ADMIN \
  --set dashboard.env.LISTEN_HOST=127.0.0.1 \
  --set dashboard.service.type=ClusterIP
 
---
# Restrict chaos to specific namespaces via FilterNamespace
# This prevents chaos from affecting critical system namespaces
 
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-mesh-config
  namespace: chaos-mesh
data:
  # Only these namespaces can be targeted
  filter_namespace: |
    default
    staging  
    production
    qa-testing
  
  # These namespaces are always protected
  protected_namespaces: |
    kube-system
    kube-public
    cert-manager
    istio-system
    monitoring

Chaos Mesh Chaos Types

Chaos Mesh provides an exceptionally rich set of chaos types, going far beyond basic pod and network chaos to include JVM, kernel, and time manipulation.

Chaos Mesh Chaos Types
Type	Capabilities	Use Cases
PodChaos	pod-kill, pod-failure, container-kill	Basic pod lifecycle testing
NetworkChaos	delay, loss, duplicate, corrupt, partition, bandwidth	Network resilience testing
StressChaos	CPU stress, memory stress	Resource exhaustion testing
IOChaos	delay, fault, attrOverride	Filesystem and I/O testing
TimeChaos	Clock skew (forward/backward)	Time-sensitive logic testing
DNSChaos	Error, random responses	Service discovery testing
JVMChaos	Exception injection, GC pressure, return value modification	JVM application testing
HTTPChaos	Abort, delay, replace, patch	HTTP traffic manipulation
KernelChaos	eBPF-based kernel fault injection	Low-level system testing
AWSChaos	EC2, EBS chaos	AWS-specific infrastructure testing
GCPChaos	VM, disk chaos	GCP-specific testing
AzureChaos	VM, disk chaos	Azure-specific testing

NetworkChaos deep dive:

NetworkChaos is one of Chaos Mesh's most powerful capabilities, offering fine-grained network manipulation:

network_chaos_example.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
# Example: Complex NetworkChaos with precise targeting
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: targeted-network-delay
  namespace: chaos-mesh
spec:
  # Action: delay, loss, duplicate, corrupt, partition, bandwidth
  action: delay
  
  # Delay configuration
  delay:
    latency: "100ms"      # Base latency
    jitter: "20ms"        # Random variation ±20ms
    correlation: "25"     # 25% correlation with previous packet
    reorder:
      reorder: "0.05"     # 5% chance of packet reordering
      correlation: "25"
      gap: 5
  
  # Precise targeting using selectors
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-gateway
    expressionSelectors:
      - { key: tier, operator: In, values: [frontend, middleware] }
    podPhaseSelectors:
      - Running
  
  # Direction and target
  direction: to          # Affect outgoing traffic
  
  # External targets for egress chaos
  externalTargets:
    - "payment-provider.example.com"
    - "192.168.100.0/24"
  
  # Target specific destinations within the cluster
  target:
    selector:
      namespaces:
        - production
      labelSelectors:
        app: database
    mode: all            # Affect all matching pods
  
  # Duration and scheduling
  duration: "5m"
  
  # Scheduler for recurring experiments
  scheduler:
    cron: "0 10 * * 1-5"  # Weekdays at 10 AM
 
---
# Network partition between two service groups
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: service-partition
  namespace: chaos-mesh
spec:
  action: partition
  
  # Source: Frontend services
  selector:
    namespaces:
      - production
    labelSelectors:
      tier: frontend
  
  # Target: Backend services
  target:
    selector:
      namespaces:
        - production
      labelSelectors:
        tier: backend
    mode: all
  
  # Bidirectional partition
  direction: both
  
  duration: "2m"

Network Chaos Precision

NetworkChaos allows targeting by namespace, labels, pod phase, specific destinations, ports, and even external IPs. This precision means you can test 'what happens when the payment gateway is slow' without affecting any other network traffic.

JVM Chaos: Application-Layer Precision

JVMChaos is one of Chaos Mesh's most distinctive capabilities. It allows injecting faults directly into running JVM applications without modifying source code—using bytecode manipulation via the Chaosblade-exec-jvm agent.

JVM Chaos Capabilities

•Exception injection — Throw specified exceptions from any method. Test exception handling without code changes.
•Return value modification — Override method return values. Test behavior with unexpected responses.
•Latency injection — Add delay to specific method calls. Test timeout handling at method level.
•GC pressure — Trigger garbage collection pressure. Test GC tuning and memory handling.
•Code hotfix — Dynamically modify running code behavior. Advanced testing and debugging.
•Thread pool chaos — Manipulate thread pool behavior. Test concurrency handling.

jvm_chaos_examples.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# Example: JVM Exception Injection
apiVersion: chaos-mesh.org/v1alpha1
kind: JVMChaos
metadata:
  name: payment-service-exception
  namespace: chaos-mesh
spec:
  # Target JVM applications
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  
  # Action type
  action: exception
  
  # Exception injection configuration
  exception:
    # Class and method to target
    class: com.company.payment.PaymentGatewayClient
    method: processPayment
    
    # Exception to throw
    exception: java.net.SocketTimeoutException
    
    # Optional: message for the exception
    message: "Connection to payment provider timed out"
  
  # JVM agent connection
  port: 9288  # Chaosblade agent port
  
  duration: "2m"
 
---
# Example: Method latency injection
apiVersion: chaos-mesh.org/v1alpha1
kind: JVMChaos
metadata:
  name: database-query-latency
  namespace: chaos-mesh
spec:
  selector:
    namespaces:
      - production
    labelSelectors:
      app: order-service
  
  action: latency
  
  latency:
    class: com.company.orders.repository.OrderRepository
    method: findOrdersByCustomerId
    
    # Add 500ms to every call to this method
    latency: 500
  
  port: 9288
  duration: "3m"
 
---
# Example: Return value modification
apiVersion: chaos-mesh.org/v1alpha1
kind: JVMChaos
metadata:
  name: feature-flag-override
  namespace: chaos-mesh
spec:
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: feature-service
  
  action: return
  
  return:
    class: com.company.features.FeatureFlagService
    method: isFeatureEnabled
    
    # Override return value
    value: "false"  # Disable all feature flags
  
  port: 9288
  duration: "5m"
 
---
# Example: GC pressure to test memory handling
apiVersion: chaos-mesh.org/v1alpha1
kind: JVMChaos
metadata:
  name: gc-pressure-test
  namespace: chaos-mesh
spec:
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: cache-service
  
  action: gc
  
  # GC pressure configuration
  gc: {}  # Triggers frequent full GC
  
  port: 9288
  duration: "5m"

JVM Agent Prerequisite

JVMChaos requires the Chaosblade JVM agent running in target pods. This is typically done via a sidecar container or by embedding the agent in application startup. Without the agent, JVMChaos experiments will fail to attach.

jvm_agent_sidecar.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Deployment with JVM chaos agent sidecar
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
        # Main application container
        - name: payment-service
          image: company/payment-service:v2.1.0
          ports:
            - containerPort: 8080
          env:
            # Enable JVM agent attachment
            - name: JAVA_TOOL_OPTIONS
              value: "-javaagent:/chaosblade/chaosblade-java-agent.jar"
          volumeMounts:
            - name: chaosblade-agent
              mountPath: /chaosblade
        
        # Chaosblade agent sidecar
        - name: chaosblade-agent
          image: chaosblade/chaosblade-tool:latest
          command: ["/bin/sh", "-c"]
          args:
            - |
              # Copy agent to shared volume
              cp /opt/chaosblade/* /chaosblade/
              # Start agent server
              /opt/chaosblade/blade server start --port 9288
              # Keep container running
              tail -f /dev/null
          ports:
            - containerPort: 9288
          volumeMounts:
            - name: chaosblade-agent
              mountPath: /chaosblade
      
      volumes:
        - name: chaosblade-agent
          emptyDir: {}

IO Chaos and Time Chaos

IOChaos and TimeChaos target foundational system behaviors—filesystem operations and system time—that can reveal deeply hidden bugs in application logic.

IOChaos capabilities:

IOChaos uses FUSE (Filesystem in Userspace) to intercept filesystem operations and inject faults:

IOChaos Actions
Action	Effect	Use Case
delay	Add latency to IO operations	Test behavior with slow disk
fault (EIO)	Return I/O errors	Test corrupted disk handling
fault (ENOSPC)	Return no space errors	Test disk full scenarios
attrOverride	Modify file attributes	Test stale stat cache handling
mistake	Inject bit flips in read/write	Test data integrity handling

io_chaos_examples.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Example: IO latency on database data directory
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: postgres-io-delay
  namespace: chaos-mesh
spec:
  action: latency
  
  selector:
    namespaces:
      - production
    labelSelectors:
      app: postgres
      role: primary
  
  # Mounted volume path to affect
  volumePath: /var/lib/postgresql/data
  
  # Target specific file patterns
  path: "/var/lib/postgresql/data/**/*.dat"
  
  # IO latency configuration
  delay: "100ms"
  
  # Percentage of operations to affect
  percent: 50
  
  # Methods to affect (read, write, both)
  methods:
    - read
    - write
  
  duration: "3m"
 
---
# Example: Disk full simulation
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: log-disk-full
  namespace: chaos-mesh
spec:
  action: fault
  
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  
  volumePath: /var/log
  
  # Return ENOSPC (no space) error
  errno: 28  # ENOSPC
  
  # Only affect write operations
  methods:
    - write
  
  # Affect 100% of writes
  percent: 100
  
  duration: "2m"

TimeChaos:

TimeChaos manipulates the system clock as perceived by specific containers. This is invaluable for testing time-sensitive logic:

TimeChaos Use Cases

•Cache expiration — Skip time forward to expire cached data without waiting
•Token expiration — Test JWT/session timeout handling
•Scheduled job triggers — Trigger cron jobs without waiting for scheduled time
•Lease expiration — Test distributed lock handling when leases expire
•Time zone edge cases — Test DST transitions and time zone boundaries
•Certificate validation — Test behavior with expired or not-yet-valid certificates

time_chaos_example.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Example: Time skew for cache expiration testing
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
  name: cache-expiration-test
  namespace: chaos-mesh
spec:
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: cache-service
  
  # Time offset (positive = future, negative = past)
  timeOffset: "2h"  # Jump 2 hours into the future
  
  # Container-specific (only this container sees modified time)
  containerNames:
    - cache-service
  
  duration: "10m"
 
---
# Example: Clock skew for distributed systems testing
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
  name: clock-skew-partition
  namespace: chaos-mesh
spec:
  selector:
    namespaces:
      - production
    labelSelectors:
      app: distributed-database
    # Only affect subset of pods
    expressionSelectors:
      - { key: zone, operator: In, values: [us-east-1a] }
  
  # Significant clock skew to test consensus protocols
  timeOffset: "-30s"  # 30 seconds in the past
  
  # Apply clock drift (gradual change)
  clockIds:
    - CLOCK_REALTIME
  
  duration: "5m"

Time Chaos Scope

TimeChaos affects container-level time perception using clock_gettime interception. It doesn't affect the node's actual time. However, applications that use kernel timers or hardware timestamps may not be fully affected by TimeChaos.

Chaos Workflows and Scheduling

Chaos Mesh provides workflow capabilities for orchestrating complex, multi-step chaos experiments with conditional logic and parallel execution.

chaos_workflow.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
# Example: Complex Chaos Workflow
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: e2e-resilience-test
  namespace: chaos-mesh
spec:
  entry: main-workflow
  
  templates:
    # Entry point: parallel execution
    - name: main-workflow
      templateType: Parallel
      deadline: 30m
      children:
        - network-degradation-path
        - resource-pressure-path
    
    # Path 1: Network degradation sequence
    - name: network-degradation-path
      templateType: Serial
      children:
        - introduce-latency
        - wait-for-adaptation
        - introduce-partition
        - verify-recovery
    
    # Step: Add network latency
    - name: introduce-latency
      templateType: NetworkChaos
      deadline: 5m
      networkChaos:
        action: delay
        delay:
          latency: "200ms"
          jitter: "50ms"
        selector:
          namespaces:
            - production
          labelSelectors:
            tier: frontend
        target:
          selector:
            namespaces:
              - production
            labelSelectors:
              tier: backend
          mode: all
        direction: to
        duration: "3m"
    
    # Wait step
    - name: wait-for-adaptation
      templateType: Suspend
      deadline: 2m
      suspend:
        duration: "1m"
    
    # Step: Create network partition
    - name: introduce-partition
      templateType: NetworkChaos
      deadline: 5m
      networkChaos:
        action: partition
        selector:
          namespaces:
            - production
          labelSelectors:
            tier: frontend
        target:
          selector:
            namespaces:
              - production
            labelSelectors:
              tier: backend
          mode: all
        direction: both
        duration: "2m"
    
    # Verification step using StatusCheck
    - name: verify-recovery
      templateType: StatusCheck
      deadline: 5m
      statusCheck:
        type: HTTP
        mode: Continuous
        http:
          url: http://api-gateway.production.svc:8080/health
          criteria:
            statusCode: "200"
        successThreshold: 3
        failureThreshold: 1
        intervalSeconds: 10
    
    # Path 2: Resource pressure
    - name: resource-pressure-path
      templateType: Serial
      children:
        - cpu-stress
        - memory-stress
    
    - name: cpu-stress
      templateType: StressChaos
      deadline: 5m
      stressChaos:
        selector:
          namespaces:
            - production
          labelSelectors:
            app: compute-service
        stressors:
          cpu:
            workers: 2
            load: 80
        duration: "3m"
        containerNames:
          - compute-service
    
    - name: memory-stress
      templateType: StressChaos
      deadline: 5m
      stressChaos:
        selector:
          namespaces:
            - production
          labelSelectors:
            app: compute-service
        stressors:
          memory:
            workers: 2
            size: "512Mi"
        duration: "3m"
        containerNames:
          - compute-service
        
---
# Schedule for recurring workflows
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: weekly-resilience-test
  namespace: chaos-mesh
spec:
  schedule: "0 10 * * 3"  # Wednesdays at 10 AM
  concurrencyPolicy: Forbid  # Don't run if previous is still running
  historyLimit: 5
  type: Workflow
  workflow:
    # Reference to workflow template
    workflowSpec:
      entry: main-workflow
      templates:
        # ... (same as above)

Workflow Template Types

•Serial — Execute steps sequentially, one after another
•Parallel — Execute steps concurrently
•Suspend — Pause workflow for specified duration
•Task — Execute custom task pods
•StatusCheck — Verify system health before proceeding
•[ChaosType] — Any chaos type (NetworkChaos, PodChaos, etc.)

Integration and Observability

Chaos Mesh integrates with the cloud-native observability ecosystem, providing visibility into chaos experiments and their impact.

Chaos Mesh Integrations
Integration	Purpose	Configuration
Prometheus	Chaos metrics scraping	ServiceMonitor CRD
Grafana	Chaos dashboards	Pre-built dashboard JSON
Kubernetes Events	Chaos event logging	Built-in
DataDog	Metrics forwarding	Prometheus remote write
ArgoCD	GitOps deployment	Native Kubernetes CRDs
GitHub Actions	CI/CD chaos gates	kubectl + Chaos Mesh CRDs

prometheus_integration.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# ServiceMonitor for Chaos Mesh metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: chaos-mesh-controller
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: controller-manager
      app.kubernetes.io/instance: chaos-mesh
  namespaceSelector:
    matchNames:
      - chaos-mesh
  endpoints:
    - port: http
      path: /metrics
      interval: 15s
 
---
# Grafana dashboard ConfigMap (simplified)
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaos-mesh-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "true"
data:
  chaos-mesh.json: |
    {
      "title": "Chaos Mesh Overview",
      "panels": [
        {
          "title": "Active Experiments",
          "targets": [
            {
              "expr": "chaos_mesh_experiments{status='running'}"
            }
          ]
        },
        {
          "title": "Experiment Success Rate",
          "targets": [
            {
              "expr": "sum(chaos_mesh_experiments{status='succeeded'}) / sum(chaos_mesh_experiments) * 100"
            }
          ]
        },
        {
          "title": "Experiments by Type",
          "targets": [
            {
              "expr": "sum by (kind) (chaos_mesh_experiments)"
            }
          ]
        }
      ]
    }
 
---
# Alert rules for chaos events
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: chaos-mesh-alerts
  namespace: monitoring
spec:
  groups:
    - name: chaos-mesh
      rules:
        - alert: ChaosExperimentRunning
          expr: chaos_mesh_experiments{status='running'} > 0
          for: 0m
          labels:
            severity: info
          annotations:
            summary: "Active chaos experiment detected"
            description: "{{ $labels.name }} in {{ $labels.namespace }} is running"
        
        - alert: ChaosExperimentFailed
          expr: increase(chaos_mesh_experiments{status='failed'}[5m]) > 0
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: "Chaos experiment failed"
            description: "Experiment {{ $labels.name }} failed"
        
        - alert: UnexpectedChaosOutsideWindow
          expr: |
            chaos_mesh_experiments{status='running'} > 0
            and on() hour() < 9 or hour() > 17
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Chaos running outside business hours"
            description: "Chaos experiments should only run 9-17. Investigate immediately."

Dashboard Correlation

Configure Grafana to overlay chaos events on application dashboards. When reviewing performance issues, seeing 'Chaos started here' annotations immediately identifies chaos-induced anomalies versus genuine production issues.

Summary: Chaos Mesh for Advanced Environments

Chaos Mesh brings precision and depth to Kubernetes chaos engineering, offering capabilities that go beyond basic pod and network disruption to include JVM-level, kernel-level, and time-based fault injection.

Key Takeaways

•Born from distributed database testing — Chaos Mesh's capabilities emerged from PingCAP's need to test TiDB's complex failure modes.
•Privileged access enables deep injection — The DaemonSet architecture with kernel capabilities allows chaos at layers other tools cannot reach.
•Rich chaos types cover all layers — From Pod and Network to JVM, IO, Time, and Kernel chaos, Chaos Mesh addresses failures at every level.
•JVMChaos provides application-layer precision — Exception injection, return value modification, and method latency without code changes.
•IO and Time chaos test fundamental assumptions — Filesystem faults and clock manipulation reveal bugs in supposedly robust code.
•Workflows enable complex scenarios — Serial and parallel execution with status checks for comprehensive resilience testing.
•CNCF Graduated status indicates maturity — Production-ready with security review completion and broad adoption.

When to choose Chaos Mesh:

You need JVM-level or kernel-level fault injection
Your applications have complex time-dependent logic
You require precise IO fault injection (not just disk full)
You want deep integration with Kubernetes-native tooling
Your environment can support privileged DaemonSets

What's next:

We've explored open-source and Kubernetes-native chaos tools. But what about organizations deeply invested in AWS? In the next page, we'll explore AWS Fault Injection Simulator, Amazon's native chaos engineering service designed specifically for AWS infrastructure.

Page Complete

You now understand Chaos Mesh's precision-focused architecture, its rich set of chaos types including JVM and kernel-level injection, workflow capabilities, and integration with the cloud-native observability stack. Chaos Mesh represents the state of the art in fine-grained Kubernetes chaos engineering.