Chaos Tools - Learning Module

Loading content...

0/273

AWS Fault Injection Simulator: Cloud-Native Chaos

Chaos Engineering from the Cloud Provider

When you run chaos engineering on AWS, you're often simulating what AWS could do directly: terminate instances, disrupt network connectivity, exhaust resources. But third-party tools can only approximate these behaviors from the outside. What if you could inject faults using the same mechanisms AWS uses internally?

AWS Fault Injection Simulator (FIS) provides exactly this capability.

Launched in 2021, FIS is Amazon's fully managed chaos engineering service. It integrates natively with AWS services—EC2, ECS, EKS, RDS, Lambda, and more—providing fault injection capabilities that would be impossible to replicate from outside AWS's infrastructure. When FIS terminates an instance, it's the same termination path a spot interruption uses. When it disrupts AZ connectivity, it's leveraging AWS networking primitives.

What You Will Learn

By the end of this page, you will understand FIS architecture and service integration model, master experiment templates and action types across AWS services, learn to design safe experiments with stop conditions and IAM guardrails, and integrate FIS into CI/CD pipelines and operational runbooks.

Why Cloud-Native Chaos Matters

Third-party chaos tools have limitations when targeting cloud infrastructure. They operate at the OS level or through APIs, which means they can't access cloud-provider internals. AWS FIS eliminates these constraints.

Third-Party vs. AWS-Native Chaos
Capability	Third-Party Tools	AWS FIS
Instance termination	API call (same as user action)	Internal termination path (like spot)
AZ failure simulation	Block network traffic manually	Native AZ isolation primitives
RDS failover	Cannot trigger internal failover	Trigger actual Multi-AZ failover
Lambda throttling	Cannot affect control plane	Apply service-level throttles
Systems Manager integration	Requires agent configuration	Native SSM integration, zero setup
IAM-based guardrails	Must implement separately	Built-in IAM policy controls

The authentication advantage:

FIS uses IAM for all authorization, which means:

Least privilege — Experiments have exactly the permissions needed, no more
Audit trail — CloudTrail logs every FIS action for compliance
Organizational controls — Service Control Policies can restrict FIS usage
Cross-account support — Run chaos across AWS accounts with proper IAM setup

First-Party Integration Benefits

Because FIS is an AWS service, it has access to internal APIs and mechanisms that external tools cannot reach. When testing AWS service behavior, FIS provides the most authentic failure simulation possible—it's testing AWS with AWS.

FIS Architecture

FIS follows a template-based architecture where experiments are defined as templates containing actions, targets, and stop conditions. Understanding these components is essential for designing effective experiments.

FIS Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
┌─────────────────────────────────────────────────────────────────────────┐
│                      AWS FIS ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────┐    │
│  │                    EXPERIMENT TEMPLATE                          │    │
│  │                                                                  │    │
│  │  ┌─────────────────────────────────────────────────────────┐   │    │
│  │  │                    ACTIONS                               │   │    │
│  │  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │   │    │
│  │  │  │ aws:ec2:     │ │ aws:rds:     │ │ aws:ssm:send-   │ │   │    │
│  │  │  │ terminate-   │ │ failover-    │ │ command         │ │   │    │
│  │  │  │ instances    │ │ db-cluster   │ │                 │ │   │    │
│  │  │  └──────────────┘ └──────────────┘ └──────────────────┘ │   │    │
│  │  └─────────────────────────────────────────────────────────┘   │    │
│  │                                                                  │    │
│  │  ┌─────────────────────────────────────────────────────────┐   │    │
│  │  │                    TARGETS                               │   │    │
│  │  │  ┌──────────────────────┐ ┌────────────────────────────┐│   │    │
│  │  │  │  Resource ARNs       │ │  Resource Tags             ││   │    │
│  │  │  │  (explicit list)     │ │  (dynamic selection)       ││   │    │
│  │  │  └──────────────────────┘ └────────────────────────────┘│   │    │
│  │  │  ┌──────────────────────┐ ┌────────────────────────────┐│   │    │
│  │  │  │  Resource Filters    │ │  Selection Mode            ││   │    │
│  │  │  │  (path-based)        │ │  (ALL, COUNT, PERCENT)     ││   │    │
│  │  │  └──────────────────────┘ └────────────────────────────┘│   │    │
│  │  └─────────────────────────────────────────────────────────┘   │    │
│  │                                                                  │    │
│  │  ┌─────────────────────────────────────────────────────────┐   │    │
│  │  │                 STOP CONDITIONS                          │   │    │
│  │  │  ┌──────────────────────────────────────────────────┐   │   │    │
│  │  │  │  CloudWatch Alarms (automatic halt on breach)    │   │   │    │
│  │  │  └──────────────────────────────────────────────────┘   │   │    │
│  │  └─────────────────────────────────────────────────────────┘   │    │
│  └───────────────────────────────────┬────────────────────────────┘    │
│                                      │                                  │
│  ════════════════════════════════════════════════════════════════════  │
│                         EXECUTION                                       │
│  ════════════════════════════════════════════════════════════════════  │
│                                      │                                  │
│  ┌───────────────────────────────────▼───────────────────────────────┐ │
│  │                     FIS SERVICE                                    │ │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────────┐│ │
│  │  │  Experiment     │  │  Action         │  │  IAM Role          ││ │
│  │  │  Orchestrator   │  │  Executor       │  │  Assumption        ││ │
│  │  └─────────────────┘  └─────────────────┘  └────────────────────┘│ │
│  └───────────────────────────────────┬───────────────────────────────┘ │
│                                      │                                  │
│  ┌───────────────────────────────────▼───────────────────────────────┐ │
│  │                    TARGET AWS SERVICES                             │ │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────────┐│ │
│  │  │  EC2    │ │  RDS    │ │  ECS    │ │  EKS    │ │  Lambda      ││ │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘ └──────────────┘│ │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────────┐│ │
│  │  │  SSM    │ │  EBS    │ │  VPC    │ │  S3     │ │  DynamoDB    ││ │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘ └──────────────┘│ │
│  └───────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘

Core Components

•Experiment Template — Reusable definition of an experiment including actions, targets, and stop conditions. Templates can be versioned and shared.
•Actions — Specific fault injection operations (terminate instance, add latency, throttle API). Each action has parameters for customization.
•Targets — Resources to affect. Can be explicit ARNs, tag-based selection, or filters. Selection modes control how many targets are affected.
•Stop Conditions — CloudWatch Alarms that halt experiments when triggered. Essential safety mechanism for preventing excessive impact.
•IAM Role — The role FIS assumes to execute actions. Controls exactly what FIS can do—no permissions in the role means the action fails.

IAM as Guardrail

FIS's IAM integration is a critical safety feature. Even if an experiment template specifies an action, FIS can only execute it if the experiment's IAM role has the necessary permissions. This allows security teams to control chaos scope through IAM policies rather than trusting experiment definitions alone.

FIS Action Catalog

FIS provides an extensive catalog of actions targeting AWS services. Understanding these actions enables you to design experiments that test the specific failure modes your applications must survive.

EC2 Actions
Action	Description	Use Case
aws:ec2:terminate-instances	Terminate selected instances	Instance failure resilience
aws:ec2:stop-instances	Stop (not terminate) instances	Planned maintenance simulation
aws:ec2:reboot-instances	Reboot instances	Reboot recovery testing
aws:ec2:send-spot-instance-interruptions	Simulate spot interruption warning	Spot instance handling

Network Actions
Action	Description	Use Case
aws:network:disrupt-connectivity	AZ or region connectivity disruption	Multi-AZ failover testing
aws:network:route-black-hole	Black hole traffic to specific targets	Service isolation testing
aws:ssm:send-command (network scripts)	Inject latency via SSM	Network performance degradation

Database and Storage Actions
Action	Description	Use Case
aws:rds:failover-db-cluster	Trigger Aurora cluster failover	Database failover testing
aws:rds:reboot-db-instances	Reboot RDS instances	Database restart recovery
aws:ebs:pause-volume-io	Pause EBS volume I/O	Storage failure testing

Container and Serverless Actions
Action	Description	Use Case
aws:ecs:drain-container-instances	Drain ECS container instances	Container migration testing
aws:ecs:stop-task	Stop ECS tasks	Task failure recovery
aws:eks:terminate-nodegroup-instances	Terminate EKS node group instances	Kubernetes node failure
aws:lambda:invoke-with-error	Invoke Lambda with injected error	Lambda error handling (preview)

SSM-Based Custom Actions
Action	Description	Use Case
aws:ssm:send-command	Run SSM document on instances	Custom fault injection
AWSFIS-Run-CPU-Stress	CPU stress via SSM	CPU exhaustion testing
AWSFIS-Run-Memory-Stress	Memory stress via SSM	Memory exhaustion testing
AWSFIS-Run-Disk-Fill	Fill disk via SSM	Disk space testing
AWSFIS-Run-Network-Latency	Add network latency via SSM	Latency injection
AWSFIS-Run-Network-Packet-Loss	Packet loss via SSM	Network reliability testing

SSM-Based Extensibility

SSM-based actions are extremely flexible. AWS provides pre-built SSM documents for common chaos scenarios, and you can create custom documents for application-specific fault injection. This makes FIS extensible for scenarios AWS hasn't explicitly built actions for.

Designing Experiment Templates

Experiment templates define the complete specification for a chaos experiment. Well-designed templates are reusable, safe, and provide clear insight into system behavior.

multi_az_failover_experiment.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
{
  "description": "Test application resilience to AZ failure by disrupting connectivity to us-east-1a",
  "tags": {
    "Environment": "production",
    "Team": "platform",
    "ChaosType": "az-failure"
  },
  "roleArn": "arn:aws:iam::123456789012:role/FISExperimentRole",
  
  "actions": {
    "DisruptAZConnectivity": {
      "actionId": "aws:network:disrupt-connectivity",
      "description": "Disrupt traffic to AZ us-east-1a",
      "parameters": {
        "scope": "availability-zone",
        "availabilityZones": "us-east-1a",
        "duration": "PT5M"
      },
      "targets": {
        "Subnets": "AZSubnets"
      }
    },
    
    "MonitorRecovery": {
      "actionId": "aws:fis:wait",
      "description": "Wait for recovery observation",
      "parameters": {
        "duration": "PT2M"
      },
      "startAfter": ["DisruptAZConnectivity"]
    }
  },
  
  "targets": {
    "AZSubnets": {
      "resourceType": "aws:ec2:subnet",
      "resourceTags": {
        "Environment": "production"
      },
      "filters": [
        {
          "path": "AvailabilityZone",
          "values": ["us-east-1a"]
        }
      ],
      "selectionMode": "ALL"
    }
  },
  
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighErrorRate"
    },
    {
      "source": "aws:cloudwatch:alarm", 
      "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:P99LatencyHigh"
    }
  ],
  
  "logConfiguration": {
    "cloudWatchLogsConfiguration": {
      "logGroupArn": "arn:aws:logs:us-east-1:123456789012:log-group:/aws/fis/experiments"
    }
  }
}

Target selection strategies:

FIS provides flexible target selection mechanisms:

target_selection_examples.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Example 1: Explicit ARN targeting
{
  "targets": {
    "SpecificInstances": {
      "resourceType": "aws:ec2:instance",
      "resourceArns": [
        "arn:aws:ec2:us-east-1:123456789012:instance/i-1234567890abcdef0",
        "arn:aws:ec2:us-east-1:123456789012:instance/i-0987654321fedcba0"
      ],
      "selectionMode": "ALL"
    }
  }
}
 
// Example 2: Tag-based selection with percentage
{
  "targets": {
    "WebServers": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {
        "Application": "web-frontend",
        "Environment": "production"
      },
      "selectionMode": "PERCENT(25)"  // Affect 25% of matching instances
    }
  }
}
 
// Example 3: Count-based selection
{
  "targets": {
    "SampleInstances": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {
        "Role": "worker"
      },
      "selectionMode": "COUNT(3)"  // Affect exactly 3 instances
    }
  }
}
 
// Example 4: Filter-based selection
{
  "targets": {
    "ProductionSubnets": {
      "resourceType": "aws:ec2:subnet",
      "resourceTags": {
        "Environment": "production"
      },
      "filters": [
        {
          "path": "State",
          "values": ["available"]
        },
        {
          "path": "VpcId",
          "values": ["vpc-12345678"]
        }
      ],
      "selectionMode": "ALL"
    }
  }
}

Production Targeting Safety

When targeting production resources, always use PERCENT or COUNT selection modes rather than ALL. This limits blast radius and prevents accidentally affecting every matching resource. Start with low percentages (10-25%) and increase as confidence builds.

Stop Conditions and Safety Controls

FIS stop conditions integrate with CloudWatch Alarms to automatically halt experiments when predefined thresholds are breached. This is the primary safety mechanism for preventing chaos from causing excessive damage.

Designing effective stop conditions:

stop_condition_alarms.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# CloudWatch Alarms for FIS Stop Conditions
 
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  # Alarm: High error rate
  HighErrorRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: FIS-StopCondition-HighErrorRate
      AlarmDescription: "Stop FIS experiment if error rate exceeds 5%"
      MetricName: HTTPCode_Target_5XX_Count
      Namespace: AWS/ApplicationELB
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref ProductionALB
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 1
      Threshold: 50  # 50 5XX errors per minute
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      
  # Alarm: High latency
  HighLatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: FIS-StopCondition-HighLatency
      AlarmDescription: "Stop FIS experiment if P99 latency exceeds 5 seconds"
      MetricName: TargetResponseTime
      Namespace: AWS/ApplicationELB
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref ProductionALB
      ExtendedStatistic: p99
      Period: 60
      EvaluationPeriods: 1
      Threshold: 5.0  # 5 seconds
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      
  # Alarm: Unhealthy targets
  UnhealthyTargetsAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: FIS-StopCondition-UnhealthyTargets
      AlarmDescription: "Stop FIS if more than 50% targets unhealthy"
      MetricName: UnHealthyHostCount
      Namespace: AWS/ApplicationELB
      Dimensions:
        - Name: TargetGroup
          Value: !Ref ProductionTargetGroup
      Statistic: Average
      Period: 60
      EvaluationPeriods: 1
      Threshold: !Ref CriticalHostThreshold
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      
  # Alarm: Database connections exhausted
  DatabaseConnectionsAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: FIS-StopCondition-DBConnections
      AlarmDescription: "Stop FIS if database connections approaching limit"
      MetricName: DatabaseConnections
      Namespace: AWS/RDS
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref ProductionDatabase
      Statistic: Average
      Period: 60
      EvaluationPeriods: 1
      Threshold: 180  # 90% of max 200 connections
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      
  # Alarm: SQS queue depth (synthetic indicator)
  QueueDepthAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: FIS-StopCondition-QueueBacklog
      AlarmDescription: "Stop FIS if processing queue backs up"
      MetricName: ApproximateNumberOfMessagesVisible
      Namespace: AWS/SQS
      Dimensions:
        - Name: QueueName
          Value: !Ref ProcessingQueue
      Statistic: Average
      Period: 60
      EvaluationPeriods: 2
      Threshold: 10000  # Queue depth threshold
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching

Safety Best Practices

•Multiple stop conditions — Use several alarms covering different failure indicators. Single points of failure in safety mechanisms are dangerous.
•Low evaluation periods — Set evaluation periods to 1 where possible. The faster experiments stop when issues occur, the less damage results.
•Pre-experiment alarm validation — Verify alarms are in OK state before starting. An already-breached alarm won't trigger stop conditions.
•Test stop conditions — Periodically verify that stop conditions actually halt experiments. Don't assume—validate.
•IAM least privilege — The experiment role should have only the permissions needed for the specific experiment, not broad infrastructure access.

fis_iam_role.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# Example: Least-privilege IAM role for FIS experiments
 
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  FISExperimentRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: FISExperimentRole-WebTier
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: fis.amazonaws.com
            Action: sts:AssumeRole
            Condition:
              StringEquals:
                aws:SourceAccount: !Ref AWS::AccountId
              ArnLike:
                aws:SourceArn: !Sub arn:aws:fis:${AWS:: Region
                            }: ${ AWS:: AccountId }: experiment/*
      
      Policies:
        # Only allow EC2 actions on specifically tagged instances
        - PolicyName: EC2ChaosPolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - ec2:TerminateInstances
                  - ec2:StopInstances
                  - ec2:StartInstances
                Resource: "*"
                Condition:
                  StringEquals:
                    "ec2:ResourceTag/ChaosEnabled": "true"
                    "ec2:ResourceTag/Environment": "production"
        
        # Allow SSM commands to tagged instances
        - PolicyName: SSMChaosPolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - ssm:SendCommand
                  - ssm:ListCommands
                  - ssm:ListCommandInvocations
                Resource:
                  - !Sub arn:aws:ssm:${AWS::Region}::document/AWSFIS-Run-*
                  - !Sub arn:aws:ec2:${AWS::Region}:${AWS::AccountId}:instance/*
                Condition:
                  StringEquals:
                    "ssm:resourceTag/ChaosEnabled": "true"
        
        # Deny critical resources even if tagged
        - PolicyName: ProtectedResourcesDeny
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Deny
                Action:
                  - ec2:TerminateInstances
                  - ec2:StopInstances
                Resource: "*"
                Condition:
                  StringEquals:
                    "ec2:ResourceTag/Protected": "true"
 
  # Tag-based protection policy
  ProtectionTaggingPolicy:
    Type: AWS::Organizations::Policy
    Properties:
      Name: RequireChaosEnabledTag
      Description: "Require ChaosEnabled tag for FIS actions"
      Type: TAG_POLICY
      Content:
        tags:
          ChaosEnabled:
            tag_key:
              @@assign: "ChaosEnabled"
            tag_value:
              @@assign:
                - "true"
                - "false"

CI/CD Integration

Integrating FIS into deployment pipelines enables automated resilience verification as part of the software delivery process.

github_actions_fis.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# GitHub Actions workflow with FIS chaos gate
name: Deploy with Chaos Verification
 
on:
  push:
    branches: [main]
 
env:
  AWS_REGION: us-east-1
  
jobs:
  deploy:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
      
    steps:
      - uses: actions/checkout@v4
      
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/GitHubActionsDeployRole
          aws-region: ${{ env.AWS_REGION }}
          
      - name: Deploy to staging
        run: |
          aws ecs update-service \
            --cluster staging-cluster \
            --service api-service \
            --force-new-deployment
          
          # Wait for deployment to stabilize
          aws ecs wait services-stable \
            --cluster staging-cluster \
            --services api-service
            
      - name: Run FIS chaos experiment
        id: chaos
        run: |
          # Start experiment
          EXPERIMENT_ID=$(aws fis start-experiment \
            --experiment-template-id EXT123abc456 \
            --query 'experiment.id' \
            --output text)
          
          echo "experiment_id=$EXPERIMENT_ID" >> $GITHUB_OUTPUT
          
          # Poll for completion
          while true; do
            STATUS=$(aws fis get-experiment \
              --id $EXPERIMENT_ID \
              --query 'experiment.state.status' \
              --output text)
              
            echo "Experiment status: $STATUS"
            
            if [ "$STATUS" = "completed" ]; then
              REASON=$(aws fis get-experiment \
                --id $EXPERIMENT_ID \
                --query 'experiment.state.reason' \
                --output text)
              
              if [ "$REASON" = "experimentCompleted" ]; then
                echo "Chaos experiment passed!"
                exit 0
              else
                echo "Chaos experiment stopped: $REASON"
                exit 1
              fi
            elif [ "$STATUS" = "failed" ] || [ "$STATUS" = "stopped" ]; then
              echo "Chaos experiment failed/stopped"
              exit 1
            fi
            
            sleep 30
          done
          
      - name: Promote to production
        if: success()
        run: |
          aws ecs update-service \
            --cluster production-cluster \
            --service api-service \
            --force-new-deployment
            
      - name: Cleanup on chaos failure
        if: failure() && steps.chaos.outputs.experiment_id
        run: |
          aws fis stop-experiment \
            --id ${{ steps.chaos.outputs.experiment_id }} || true
          
          # Rollback staging deployment
          aws ecs update-service \
            --cluster staging-cluster \
            --service api-service \
            --task-definition api-service:previous

AWS CodePipeline integration:

codepipeline_fis.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# CodePipeline with FIS chaos gate stage
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  DeploymentPipeline:
    Type: AWS::CodePipeline::Pipeline
    Properties:
      Name: ChaosGatedDeployment
      RoleArn: !GetAtt PipelineRole.Arn
      
      Stages:
        - Name: Source
          Actions:
            - Name: SourceAction
              ActionTypeId:
                Category: Source
                Owner: AWS
                Provider: CodeStarSourceConnection
                Version: "1"
              Configuration:
                ConnectionArn: !Ref CodeStarConnection
                FullRepositoryId: org/repo
                BranchName: main
              OutputArtifacts:
                - Name: SourceOutput
                
        - Name: DeployStaging
          Actions:
            - Name: DeployToStaging
              ActionTypeId:
                Category: Deploy
                Owner: AWS
                Provider: ECS
                Version: "1"
              Configuration:
                ClusterName: staging-cluster
                ServiceName: api-service
              InputArtifacts:
                - Name: SourceOutput
                
        - Name: ChaosVerification
          Actions:
            - Name: RunChaosExperiment
              ActionTypeId:
                Category: Invoke
                Owner: AWS
                Provider: Lambda
                Version: "1"
              Configuration:
                FunctionName: !Ref ChaosOrchestrationLambda
                UserParameters: |
                  {
                    "experimentTemplateId": "EXT123abc456",
                    "maxWaitSeconds": 600
                  }
                  
        - Name: ApprovalGate
          Actions:
            - Name: ManualApproval
              ActionTypeId:
                Category: Approval
                Owner: AWS
                Provider: Manual
                Version: "1"
              Configuration:
                CustomData: "Chaos experiment passed. Approve production deployment?"
                
        - Name: DeployProduction
          Actions:
            - Name: DeployToProduction
              ActionTypeId:
                Category: Deploy
                Owner: AWS
                Provider: ECS
                Version: "1"
              Configuration:
                ClusterName: production-cluster
                ServiceName: api-service
              InputArtifacts:
                - Name: SourceOutput
 
  # Lambda function to orchestrate FIS experiments
  ChaosOrchestrationLambda:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: ChaosOrchestration
      Runtime: python3.11
      Handler: index.handler
      Timeout: 900
      Role: !GetAtt LambdaRole.Arn
      Code:
        ZipFile: |
          import boto3
          import time
          import json
          
          def handler(event, context):
              fis = boto3.client('fis')
              params = json.loads(event['CodePipeline.job']['data']['actionConfiguration']['configuration']['UserParameters'])
              
              # Start experiment
              response = fis.start_experiment(
                  experimentTemplateId=params['experimentTemplateId']
              )
              experiment_id = response['experiment']['id']
              
              # Wait for completion
              max_wait = params.get('maxWaitSeconds', 600)
              start = time.time()
              
              while time.time() - start < max_wait:
                  exp = fis.get_experiment(id=experiment_id)
                  status = exp['experiment']['state']['status']
                  
                  if status == 'completed':
                      reason = exp['experiment']['state'].get('reason', '')
                      if reason == 'experimentCompleted':
                          return {'statusCode': 200, 'body': 'Chaos passed'}
                      else:
                          raise Exception(f'Experiment stopped: {reason}')
                  elif status in ['failed', 'stopped']:
                      raise Exception(f'Experiment {status}')
                      
                  time.sleep(30)
              
              # Timeout - stop experiment and fail
              fis.stop_experiment(id=experiment_id)
              raise Exception('Experiment timed out')

Operational Practices

Effective FIS usage requires operational practices that ensure chaos provides value without causing incidents.

Pre-Experiment Checklist

•Verify stop condition alarms are OK — Alarms already in ALARM state won't trigger stop conditions
•Confirm no active deployments — Chaos during rollouts creates confusing signals
•Check current incident status — Never run chaos during active incidents
•Notify affected teams — Even scheduled chaos should be announced
•Verify rollback procedures — Ensure you can recover if chaos reveals issues
•Confirm observability dashboards are accessible — Can't learn from chaos you can't observe

Experiment scheduling strategy:

FIS Scheduling Recommendations
Experiment Type	Frequency	Timing	Team Involvement
Instance termination (single)	Daily	Business hours on-call	Automated, no involvement
AZ failure simulation	Weekly	Scheduled with notice	Team observes
Multi-component scenarios	Bi-weekly	GameDay format	Full team participation
Region failover	Monthly/Quarterly	Planned event	Cross-team coordination

eventbridge_scheduling.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# EventBridge rule for scheduled FIS experiments
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  DailyChaosSchedule:
    Type: AWS::Events::Rule
    Properties:
      Name: DailyChaosTesting
      Description: "Run daily chaos experiment at 10 AM PST"
      ScheduleExpression: "cron(0 18 ? * MON-FRI *)"  # 10 AM PST = 18:00 UTC
      State: ENABLED
      Targets:
        - Id: StartFISExperiment
          Arn: !Sub arn:aws:fis:${AWS::Region}:${AWS::AccountId}:experiment-template/EXT123abc456
          RoleArn: !GetAtt EventBridgeFISRole.Arn
          
  # Conditional chaos based on deployment status          
  PostDeploymentChaos:
    Type: AWS::Events::Rule
    Properties:
      Name: PostDeploymentChaos
      Description: "Run chaos after successful deployments"
      EventPattern:
        source:
          - aws.codepipeline
        detail-type:
          - "CodePipeline Stage Execution State Change"
        detail:
          stage:
            - DeployStaging
          state:
            - SUCCEEDED
      State: ENABLED
      Targets:
        - Id: StartPostDeployChaos
          Arn: !GetAtt ChaosOrchestrationLambda.Arn
          
  # Blackout window - disable chaos during incidents
  ChaosBlackoutParameter:
    Type: AWS::SSM::Parameter
    Properties:
      Name: /chaos/blackout-enabled
      Type: String
      Value: "false"
      Description: "Set to true during incidents to prevent scheduled chaos"

Incident Integration

Implement a blackout mechanism that disables scheduled chaos during active incidents. Chaos during an incident compounds confusion and extends recovery time. Your incident management system should automatically set the blackout parameter.

Summary: FIS as AWS-Native Chaos

AWS Fault Injection Simulator brings chaos engineering natively into the AWS ecosystem, providing capabilities that only the cloud provider can offer.

Key Takeaways

•First-party integration provides authenticity — FIS uses AWS-internal mechanisms, providing failure simulation that external tools cannot replicate.
•Template-based architecture enables reuse — Experiment templates can be versioned, shared, and evolved over time.
•Rich action catalog covers AWS services — EC2, RDS, ECS, EKS, Lambda, network, and SSM-based custom actions.
•Flexible targeting mechanisms — ARNs, tags, filters, and selection modes (ALL, COUNT, PERCENT) for precise targeting.
•CloudWatch stop conditions provide safety — Automatic experiment halt when predefined thresholds are breached.
•IAM integration enables guardrails — Experiments can only do what their IAM role permits, enabling security team oversight.
•CI/CD integration automates verification — Chaos gates in pipelines validate resilience before production deployment.

When to choose AWS FIS:

Your infrastructure is primarily on AWS
You need AWS-native failure modes (AZ disruption, RDS failover)
You want IAM-based access control for chaos
Compliance requires CloudTrail audit logging
You prefer managed services over self-hosted tools

Module complete:

You've now explored the major chaos engineering tools available today: Chaos Monkey (the pioneer), Gremlin (enterprise platform), LitmusChaos (Kubernetes-native), Chaos Mesh (precision chaos), and AWS FIS (cloud-native). Each tool has its strengths; the right choice depends on your infrastructure, organizational needs, and chaos engineering maturity.

Module Complete

You now have comprehensive knowledge of the chaos engineering tool landscape. From Netflix's pioneering Chaos Monkey to AWS's native Fault Injection Simulator, you understand each tool's architecture, capabilities, and appropriate use cases. Apply this knowledge to select and implement chaos practices that match your organization's infrastructure and maturity level.