Loading content...
When you run chaos engineering on AWS, you're often simulating what AWS could do directly: terminate instances, disrupt network connectivity, exhaust resources. But third-party tools can only approximate these behaviors from the outside. What if you could inject faults using the same mechanisms AWS uses internally?
AWS Fault Injection Simulator (FIS) provides exactly this capability.
Launched in 2021, FIS is Amazon's fully managed chaos engineering service. It integrates natively with AWS services—EC2, ECS, EKS, RDS, Lambda, and more—providing fault injection capabilities that would be impossible to replicate from outside AWS's infrastructure. When FIS terminates an instance, it's the same termination path a spot interruption uses. When it disrupts AZ connectivity, it's leveraging AWS networking primitives.
By the end of this page, you will understand FIS architecture and service integration model, master experiment templates and action types across AWS services, learn to design safe experiments with stop conditions and IAM guardrails, and integrate FIS into CI/CD pipelines and operational runbooks.
Third-party chaos tools have limitations when targeting cloud infrastructure. They operate at the OS level or through APIs, which means they can't access cloud-provider internals. AWS FIS eliminates these constraints.
| Capability | Third-Party Tools | AWS FIS |
|---|---|---|
| Instance termination | API call (same as user action) | Internal termination path (like spot) |
| AZ failure simulation | Block network traffic manually | Native AZ isolation primitives |
| RDS failover | Cannot trigger internal failover | Trigger actual Multi-AZ failover |
| Lambda throttling | Cannot affect control plane | Apply service-level throttles |
| Systems Manager integration | Requires agent configuration | Native SSM integration, zero setup |
| IAM-based guardrails | Must implement separately | Built-in IAM policy controls |
The authentication advantage:
FIS uses IAM for all authorization, which means:
Because FIS is an AWS service, it has access to internal APIs and mechanisms that external tools cannot reach. When testing AWS service behavior, FIS provides the most authentic failure simulation possible—it's testing AWS with AWS.
FIS follows a template-based architecture where experiments are defined as templates containing actions, targets, and stop conditions. Understanding these components is essential for designing effective experiments.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
┌─────────────────────────────────────────────────────────────────────────┐│ AWS FIS ARCHITECTURE │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ EXPERIMENT TEMPLATE │ ││ │ │ ││ │ ┌─────────────────────────────────────────────────────────┐ │ ││ │ │ ACTIONS │ │ ││ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │ ││ │ │ │ aws:ec2: │ │ aws:rds: │ │ aws:ssm:send- │ │ │ ││ │ │ │ terminate- │ │ failover- │ │ command │ │ │ ││ │ │ │ instances │ │ db-cluster │ │ │ │ │ ││ │ │ └──────────────┘ └──────────────┘ └──────────────────┘ │ │ ││ │ └─────────────────────────────────────────────────────────┘ │ ││ │ │ ││ │ ┌─────────────────────────────────────────────────────────┐ │ ││ │ │ TARGETS │ │ ││ │ │ ┌──────────────────────┐ ┌────────────────────────────┐│ │ ││ │ │ │ Resource ARNs │ │ Resource Tags ││ │ ││ │ │ │ (explicit list) │ │ (dynamic selection) ││ │ ││ │ │ └──────────────────────┘ └────────────────────────────┘│ │ ││ │ │ ┌──────────────────────┐ ┌────────────────────────────┐│ │ ││ │ │ │ Resource Filters │ │ Selection Mode ││ │ ││ │ │ │ (path-based) │ │ (ALL, COUNT, PERCENT) ││ │ ││ │ │ └──────────────────────┘ └────────────────────────────┘│ │ ││ │ └─────────────────────────────────────────────────────────┘ │ ││ │ │ ││ │ ┌─────────────────────────────────────────────────────────┐ │ ││ │ │ STOP CONDITIONS │ │ ││ │ │ ┌──────────────────────────────────────────────────┐ │ │ ││ │ │ │ CloudWatch Alarms (automatic halt on breach) │ │ │ ││ │ │ └──────────────────────────────────────────────────┘ │ │ ││ │ └─────────────────────────────────────────────────────────┘ │ ││ └───────────────────────────────────┬────────────────────────────┘ ││ │ ││ ════════════════════════════════════════════════════════════════════ ││ EXECUTION ││ ════════════════════════════════════════════════════════════════════ ││ │ ││ ┌───────────────────────────────────▼───────────────────────────────┐ ││ │ FIS SERVICE │ ││ │ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────────┐│ ││ │ │ Experiment │ │ Action │ │ IAM Role ││ ││ │ │ Orchestrator │ │ Executor │ │ Assumption ││ ││ │ └─────────────────┘ └─────────────────┘ └────────────────────┘│ ││ └───────────────────────────────────┬───────────────────────────────┘ ││ │ ││ ┌───────────────────────────────────▼───────────────────────────────┐ ││ │ TARGET AWS SERVICES │ ││ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────────┐│ ││ │ │ EC2 │ │ RDS │ │ ECS │ │ EKS │ │ Lambda ││ ││ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └──────────────┘│ ││ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────────┐│ ││ │ │ SSM │ │ EBS │ │ VPC │ │ S3 │ │ DynamoDB ││ ││ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └──────────────┘│ ││ └───────────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────────┘FIS's IAM integration is a critical safety feature. Even if an experiment template specifies an action, FIS can only execute it if the experiment's IAM role has the necessary permissions. This allows security teams to control chaos scope through IAM policies rather than trusting experiment definitions alone.
FIS provides an extensive catalog of actions targeting AWS services. Understanding these actions enables you to design experiments that test the specific failure modes your applications must survive.
| Action | Description | Use Case |
|---|---|---|
| aws:ec2:terminate-instances | Terminate selected instances | Instance failure resilience |
| aws:ec2:stop-instances | Stop (not terminate) instances | Planned maintenance simulation |
| aws:ec2:reboot-instances | Reboot instances | Reboot recovery testing |
| aws:ec2:send-spot-instance-interruptions | Simulate spot interruption warning | Spot instance handling |
| Action | Description | Use Case |
|---|---|---|
| aws:network:disrupt-connectivity | AZ or region connectivity disruption | Multi-AZ failover testing |
| aws:network:route-black-hole | Black hole traffic to specific targets | Service isolation testing |
| aws:ssm:send-command (network scripts) | Inject latency via SSM | Network performance degradation |
| Action | Description | Use Case |
|---|---|---|
| aws:rds:failover-db-cluster | Trigger Aurora cluster failover | Database failover testing |
| aws:rds:reboot-db-instances | Reboot RDS instances | Database restart recovery |
| aws:ebs:pause-volume-io | Pause EBS volume I/O | Storage failure testing |
| Action | Description | Use Case |
|---|---|---|
| aws:ecs:drain-container-instances | Drain ECS container instances | Container migration testing |
| aws:ecs:stop-task | Stop ECS tasks | Task failure recovery |
| aws:eks:terminate-nodegroup-instances | Terminate EKS node group instances | Kubernetes node failure |
| aws:lambda:invoke-with-error | Invoke Lambda with injected error | Lambda error handling (preview) |
| Action | Description | Use Case |
|---|---|---|
| aws:ssm:send-command | Run SSM document on instances | Custom fault injection |
| AWSFIS-Run-CPU-Stress | CPU stress via SSM | CPU exhaustion testing |
| AWSFIS-Run-Memory-Stress | Memory stress via SSM | Memory exhaustion testing |
| AWSFIS-Run-Disk-Fill | Fill disk via SSM | Disk space testing |
| AWSFIS-Run-Network-Latency | Add network latency via SSM | Latency injection |
| AWSFIS-Run-Network-Packet-Loss | Packet loss via SSM | Network reliability testing |
SSM-based actions are extremely flexible. AWS provides pre-built SSM documents for common chaos scenarios, and you can create custom documents for application-specific fault injection. This makes FIS extensible for scenarios AWS hasn't explicitly built actions for.
Experiment templates define the complete specification for a chaos experiment. Well-designed templates are reusable, safe, and provide clear insight into system behavior.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
{ "description": "Test application resilience to AZ failure by disrupting connectivity to us-east-1a", "tags": { "Environment": "production", "Team": "platform", "ChaosType": "az-failure" }, "roleArn": "arn:aws:iam::123456789012:role/FISExperimentRole", "actions": { "DisruptAZConnectivity": { "actionId": "aws:network:disrupt-connectivity", "description": "Disrupt traffic to AZ us-east-1a", "parameters": { "scope": "availability-zone", "availabilityZones": "us-east-1a", "duration": "PT5M" }, "targets": { "Subnets": "AZSubnets" } }, "MonitorRecovery": { "actionId": "aws:fis:wait", "description": "Wait for recovery observation", "parameters": { "duration": "PT2M" }, "startAfter": ["DisruptAZConnectivity"] } }, "targets": { "AZSubnets": { "resourceType": "aws:ec2:subnet", "resourceTags": { "Environment": "production" }, "filters": [ { "path": "AvailabilityZone", "values": ["us-east-1a"] } ], "selectionMode": "ALL" } }, "stopConditions": [ { "source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighErrorRate" }, { "source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:P99LatencyHigh" } ], "logConfiguration": { "cloudWatchLogsConfiguration": { "logGroupArn": "arn:aws:logs:us-east-1:123456789012:log-group:/aws/fis/experiments" } }}Target selection strategies:
FIS provides flexible target selection mechanisms:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
// Example 1: Explicit ARN targeting{ "targets": { "SpecificInstances": { "resourceType": "aws:ec2:instance", "resourceArns": [ "arn:aws:ec2:us-east-1:123456789012:instance/i-1234567890abcdef0", "arn:aws:ec2:us-east-1:123456789012:instance/i-0987654321fedcba0" ], "selectionMode": "ALL" } }} // Example 2: Tag-based selection with percentage{ "targets": { "WebServers": { "resourceType": "aws:ec2:instance", "resourceTags": { "Application": "web-frontend", "Environment": "production" }, "selectionMode": "PERCENT(25)" // Affect 25% of matching instances } }} // Example 3: Count-based selection{ "targets": { "SampleInstances": { "resourceType": "aws:ec2:instance", "resourceTags": { "Role": "worker" }, "selectionMode": "COUNT(3)" // Affect exactly 3 instances } }} // Example 4: Filter-based selection{ "targets": { "ProductionSubnets": { "resourceType": "aws:ec2:subnet", "resourceTags": { "Environment": "production" }, "filters": [ { "path": "State", "values": ["available"] }, { "path": "VpcId", "values": ["vpc-12345678"] } ], "selectionMode": "ALL" } }}When targeting production resources, always use PERCENT or COUNT selection modes rather than ALL. This limits blast radius and prevents accidentally affecting every matching resource. Start with low percentages (10-25%) and increase as confidence builds.
FIS stop conditions integrate with CloudWatch Alarms to automatically halt experiments when predefined thresholds are breached. This is the primary safety mechanism for preventing chaos from causing excessive damage.
Designing effective stop conditions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
# CloudWatch Alarms for FIS Stop Conditions AWSTemplateFormatVersion: '2010-09-09'Resources: # Alarm: High error rate HighErrorRateAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: FIS-StopCondition-HighErrorRate AlarmDescription: "Stop FIS experiment if error rate exceeds 5%" MetricName: HTTPCode_Target_5XX_Count Namespace: AWS/ApplicationELB Dimensions: - Name: LoadBalancer Value: !Ref ProductionALB Statistic: Sum Period: 60 EvaluationPeriods: 1 Threshold: 50 # 50 5XX errors per minute ComparisonOperator: GreaterThanThreshold TreatMissingData: notBreaching # Alarm: High latency HighLatencyAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: FIS-StopCondition-HighLatency AlarmDescription: "Stop FIS experiment if P99 latency exceeds 5 seconds" MetricName: TargetResponseTime Namespace: AWS/ApplicationELB Dimensions: - Name: LoadBalancer Value: !Ref ProductionALB ExtendedStatistic: p99 Period: 60 EvaluationPeriods: 1 Threshold: 5.0 # 5 seconds ComparisonOperator: GreaterThanThreshold TreatMissingData: notBreaching # Alarm: Unhealthy targets UnhealthyTargetsAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: FIS-StopCondition-UnhealthyTargets AlarmDescription: "Stop FIS if more than 50% targets unhealthy" MetricName: UnHealthyHostCount Namespace: AWS/ApplicationELB Dimensions: - Name: TargetGroup Value: !Ref ProductionTargetGroup Statistic: Average Period: 60 EvaluationPeriods: 1 Threshold: !Ref CriticalHostThreshold ComparisonOperator: GreaterThanThreshold TreatMissingData: notBreaching # Alarm: Database connections exhausted DatabaseConnectionsAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: FIS-StopCondition-DBConnections AlarmDescription: "Stop FIS if database connections approaching limit" MetricName: DatabaseConnections Namespace: AWS/RDS Dimensions: - Name: DBInstanceIdentifier Value: !Ref ProductionDatabase Statistic: Average Period: 60 EvaluationPeriods: 1 Threshold: 180 # 90% of max 200 connections ComparisonOperator: GreaterThanThreshold TreatMissingData: notBreaching # Alarm: SQS queue depth (synthetic indicator) QueueDepthAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: FIS-StopCondition-QueueBacklog AlarmDescription: "Stop FIS if processing queue backs up" MetricName: ApproximateNumberOfMessagesVisible Namespace: AWS/SQS Dimensions: - Name: QueueName Value: !Ref ProcessingQueue Statistic: Average Period: 60 EvaluationPeriods: 2 Threshold: 10000 # Queue depth threshold ComparisonOperator: GreaterThanThreshold TreatMissingData: notBreaching1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
# Example: Least-privilege IAM role for FIS experiments AWSTemplateFormatVersion: '2010-09-09'Resources: FISExperimentRole: Type: AWS::IAM::Role Properties: RoleName: FISExperimentRole-WebTier AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: fis.amazonaws.com Action: sts:AssumeRole Condition: StringEquals: aws:SourceAccount: !Ref AWS::AccountId ArnLike: aws:SourceArn: !Sub arn:aws:fis:${AWS:: Region }: ${ AWS:: AccountId }: experiment/* Policies: # Only allow EC2 actions on specifically tagged instances - PolicyName: EC2ChaosPolicy PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - ec2:TerminateInstances - ec2:StopInstances - ec2:StartInstances Resource: "*" Condition: StringEquals: "ec2:ResourceTag/ChaosEnabled": "true" "ec2:ResourceTag/Environment": "production" # Allow SSM commands to tagged instances - PolicyName: SSMChaosPolicy PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - ssm:SendCommand - ssm:ListCommands - ssm:ListCommandInvocations Resource: - !Sub arn:aws:ssm:${AWS::Region}::document/AWSFIS-Run-* - !Sub arn:aws:ec2:${AWS::Region}:${AWS::AccountId}:instance/* Condition: StringEquals: "ssm:resourceTag/ChaosEnabled": "true" # Deny critical resources even if tagged - PolicyName: ProtectedResourcesDeny PolicyDocument: Version: '2012-10-17' Statement: - Effect: Deny Action: - ec2:TerminateInstances - ec2:StopInstances Resource: "*" Condition: StringEquals: "ec2:ResourceTag/Protected": "true" # Tag-based protection policy ProtectionTaggingPolicy: Type: AWS::Organizations::Policy Properties: Name: RequireChaosEnabledTag Description: "Require ChaosEnabled tag for FIS actions" Type: TAG_POLICY Content: tags: ChaosEnabled: tag_key: @@assign: "ChaosEnabled" tag_value: @@assign: - "true" - "false"Integrating FIS into deployment pipelines enables automated resilience verification as part of the software delivery process.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
# GitHub Actions workflow with FIS chaos gatename: Deploy with Chaos Verification on: push: branches: [main] env: AWS_REGION: us-east-1 jobs: deploy: runs-on: ubuntu-latest permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - name: Configure AWS credentials uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::123456789012:role/GitHubActionsDeployRole aws-region: ${{ env.AWS_REGION }} - name: Deploy to staging run: | aws ecs update-service \ --cluster staging-cluster \ --service api-service \ --force-new-deployment # Wait for deployment to stabilize aws ecs wait services-stable \ --cluster staging-cluster \ --services api-service - name: Run FIS chaos experiment id: chaos run: | # Start experiment EXPERIMENT_ID=$(aws fis start-experiment \ --experiment-template-id EXT123abc456 \ --query 'experiment.id' \ --output text) echo "experiment_id=$EXPERIMENT_ID" >> $GITHUB_OUTPUT # Poll for completion while true; do STATUS=$(aws fis get-experiment \ --id $EXPERIMENT_ID \ --query 'experiment.state.status' \ --output text) echo "Experiment status: $STATUS" if [ "$STATUS" = "completed" ]; then REASON=$(aws fis get-experiment \ --id $EXPERIMENT_ID \ --query 'experiment.state.reason' \ --output text) if [ "$REASON" = "experimentCompleted" ]; then echo "Chaos experiment passed!" exit 0 else echo "Chaos experiment stopped: $REASON" exit 1 fi elif [ "$STATUS" = "failed" ] || [ "$STATUS" = "stopped" ]; then echo "Chaos experiment failed/stopped" exit 1 fi sleep 30 done - name: Promote to production if: success() run: | aws ecs update-service \ --cluster production-cluster \ --service api-service \ --force-new-deployment - name: Cleanup on chaos failure if: failure() && steps.chaos.outputs.experiment_id run: | aws fis stop-experiment \ --id ${{ steps.chaos.outputs.experiment_id }} || true # Rollback staging deployment aws ecs update-service \ --cluster staging-cluster \ --service api-service \ --task-definition api-service:previousAWS CodePipeline integration:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
# CodePipeline with FIS chaos gate stageAWSTemplateFormatVersion: '2010-09-09'Resources: DeploymentPipeline: Type: AWS::CodePipeline::Pipeline Properties: Name: ChaosGatedDeployment RoleArn: !GetAtt PipelineRole.Arn Stages: - Name: Source Actions: - Name: SourceAction ActionTypeId: Category: Source Owner: AWS Provider: CodeStarSourceConnection Version: "1" Configuration: ConnectionArn: !Ref CodeStarConnection FullRepositoryId: org/repo BranchName: main OutputArtifacts: - Name: SourceOutput - Name: DeployStaging Actions: - Name: DeployToStaging ActionTypeId: Category: Deploy Owner: AWS Provider: ECS Version: "1" Configuration: ClusterName: staging-cluster ServiceName: api-service InputArtifacts: - Name: SourceOutput - Name: ChaosVerification Actions: - Name: RunChaosExperiment ActionTypeId: Category: Invoke Owner: AWS Provider: Lambda Version: "1" Configuration: FunctionName: !Ref ChaosOrchestrationLambda UserParameters: | { "experimentTemplateId": "EXT123abc456", "maxWaitSeconds": 600 } - Name: ApprovalGate Actions: - Name: ManualApproval ActionTypeId: Category: Approval Owner: AWS Provider: Manual Version: "1" Configuration: CustomData: "Chaos experiment passed. Approve production deployment?" - Name: DeployProduction Actions: - Name: DeployToProduction ActionTypeId: Category: Deploy Owner: AWS Provider: ECS Version: "1" Configuration: ClusterName: production-cluster ServiceName: api-service InputArtifacts: - Name: SourceOutput # Lambda function to orchestrate FIS experiments ChaosOrchestrationLambda: Type: AWS::Lambda::Function Properties: FunctionName: ChaosOrchestration Runtime: python3.11 Handler: index.handler Timeout: 900 Role: !GetAtt LambdaRole.Arn Code: ZipFile: | import boto3 import time import json def handler(event, context): fis = boto3.client('fis') params = json.loads(event['CodePipeline.job']['data']['actionConfiguration']['configuration']['UserParameters']) # Start experiment response = fis.start_experiment( experimentTemplateId=params['experimentTemplateId'] ) experiment_id = response['experiment']['id'] # Wait for completion max_wait = params.get('maxWaitSeconds', 600) start = time.time() while time.time() - start < max_wait: exp = fis.get_experiment(id=experiment_id) status = exp['experiment']['state']['status'] if status == 'completed': reason = exp['experiment']['state'].get('reason', '') if reason == 'experimentCompleted': return {'statusCode': 200, 'body': 'Chaos passed'} else: raise Exception(f'Experiment stopped: {reason}') elif status in ['failed', 'stopped']: raise Exception(f'Experiment {status}') time.sleep(30) # Timeout - stop experiment and fail fis.stop_experiment(id=experiment_id) raise Exception('Experiment timed out')Effective FIS usage requires operational practices that ensure chaos provides value without causing incidents.
Experiment scheduling strategy:
| Experiment Type | Frequency | Timing | Team Involvement |
|---|---|---|---|
| Instance termination (single) | Daily | Business hours on-call | Automated, no involvement |
| AZ failure simulation | Weekly | Scheduled with notice | Team observes |
| Multi-component scenarios | Bi-weekly | GameDay format | Full team participation |
| Region failover | Monthly/Quarterly | Planned event | Cross-team coordination |
1234567891011121314151617181920212223242526272829303132333435363738394041424344
# EventBridge rule for scheduled FIS experimentsAWSTemplateFormatVersion: '2010-09-09'Resources: DailyChaosSchedule: Type: AWS::Events::Rule Properties: Name: DailyChaosTesting Description: "Run daily chaos experiment at 10 AM PST" ScheduleExpression: "cron(0 18 ? * MON-FRI *)" # 10 AM PST = 18:00 UTC State: ENABLED Targets: - Id: StartFISExperiment Arn: !Sub arn:aws:fis:${AWS::Region}:${AWS::AccountId}:experiment-template/EXT123abc456 RoleArn: !GetAtt EventBridgeFISRole.Arn # Conditional chaos based on deployment status PostDeploymentChaos: Type: AWS::Events::Rule Properties: Name: PostDeploymentChaos Description: "Run chaos after successful deployments" EventPattern: source: - aws.codepipeline detail-type: - "CodePipeline Stage Execution State Change" detail: stage: - DeployStaging state: - SUCCEEDED State: ENABLED Targets: - Id: StartPostDeployChaos Arn: !GetAtt ChaosOrchestrationLambda.Arn # Blackout window - disable chaos during incidents ChaosBlackoutParameter: Type: AWS::SSM::Parameter Properties: Name: /chaos/blackout-enabled Type: String Value: "false" Description: "Set to true during incidents to prevent scheduled chaos"Implement a blackout mechanism that disables scheduled chaos during active incidents. Chaos during an incident compounds confusion and extends recovery time. Your incident management system should automatically set the blackout parameter.
AWS Fault Injection Simulator brings chaos engineering natively into the AWS ecosystem, providing capabilities that only the cloud provider can offer.
When to choose AWS FIS:
Module complete:
You've now explored the major chaos engineering tools available today: Chaos Monkey (the pioneer), Gremlin (enterprise platform), LitmusChaos (Kubernetes-native), Chaos Mesh (precision chaos), and AWS FIS (cloud-native). Each tool has its strengths; the right choice depends on your infrastructure, organizational needs, and chaos engineering maturity.
You now have comprehensive knowledge of the chaos engineering tool landscape. From Netflix's pioneering Chaos Monkey to AWS's native Fault Injection Simulator, you understand each tool's architecture, capabilities, and appropriate use cases. Apply this knowledge to select and implement chaos practices that match your organization's infrastructure and maturity level.