System Design (HLD)Chaos Tools

Chaos Engineering Tools

LevelAdvanced

Duration120 mins

TopicChaos Tools

2 / 5

Gremlin: Enterprise Chaos Engineering Platform

Chaos Engineering for Everyone

Netflix had thousands of engineers and years of experience to build their chaos engineering practice from scratch. But what about organizations without those resources? What about companies that want the benefits of chaos engineering without building custom tooling, maintaining infrastructure, and developing in-house expertise?

This is the problem Gremlin set out to solve.

Founded in 2016 by former Netflix and Amazon engineers—including Kolton Andrus, who led chaos engineering at Amazon—Gremlin created the first commercial chaos engineering platform. Their mission: make chaos engineering accessible, safe, and valuable for any organization, regardless of size or technical sophistication.

Gremlin transformed chaos engineering from a practice that only engineering elite could implement into a discipline accessible to teams across the technology spectrum.

What You Will Learn

By the end of this page, you will understand Gremlin's architecture and operational model, master its attack types and targeting mechanisms, learn to design safe and effective chaos experiments, and evaluate when Gremlin is the right choice for your organization.

The Enterprise Chaos Problem

Before Gremlin, organizations attempting chaos engineering faced significant barriers. Understanding these barriers illuminates why Gremlin's approach was transformative.

Barriers to Chaos Engineering Adoption (Pre-Gremlin Era)
Barrier	Impact	What Organizations Did Instead
Custom tooling required	Months of engineering investment	Skipped chaos engineering entirely
Deep infrastructure knowledge needed	Only platform teams could participate	Limited to simple tests by experts
Safety mechanisms not obvious	Fear of causing production outages	Stuck to staging environments
No standardized attack types	Reinventing failure scenarios	Ad-hoc manual testing
Limited visibility into experiments	Difficult to correlate cause and effect	Blamed chaos for unrelated issues
No compliance/audit features	Enterprise approval blocked	Security teams vetoed chaos

The Netflix paradox:

Netflix's success with Chaos Monkey created an aspirational goal but didn't provide a roadmap for organizations that weren't Netflix. Many teams tried to replicate their approach and discovered that:

Building reliable chaos tooling takes significant engineering effort
Chaos tools without proper safety controls are dangerous
Getting organizational buy-in requires demonstrating value before investment
Maintaining chaos infrastructure competes with product development resources

Gremlin's founders recognized that chaos engineering needed to become a product, not a project—something organizations could adopt incrementally without building from scratch.

The Gremlin Insight

Chaos engineering's biggest obstacle wasn't technical—it was organizational. Teams weren't avoiding chaos because it was impossible; they were avoiding it because it was expensive, scary, and hard to justify. Gremlin attacked all three barriers simultaneously.

Gremlin Architecture

Gremlin's architecture reflects its enterprise mission: centralized control, distributed execution, comprehensive observability, and robust safety mechanisms. Understanding this architecture is essential for effective deployment and operation.

Gremlin Component Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
┌─────────────────────────────────────────────────────────────────────────┐
│                      GREMLIN ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────┐    │
│  │              GREMLIN CONTROL PLANE (SaaS)                       │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐ │    │
│  │  │  Web UI     │  │  REST API   │  │  Attack Orchestration   │ │    │
│  │  │  Dashboard  │  │  & CLI      │  │  Engine                 │ │    │
│  │  └─────────────┘  └─────────────┘  └─────────────────────────┘ │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐ │    │
│  │  │  Team Mgmt  │  │  Scenario   │  │  Analytics &            │ │    │
│  │  │  & RBAC     │  │  Library    │  │  Reporting              │ │    │
│  │  └─────────────┘  └─────────────┘  └─────────────────────────┘ │    │
│  └─────────────────────────────┬──────────────────────────────────┘    │
│                                │                                        │
│                    Secure TLS + mTLS Connection                        │
│                                │                                        │
│  ┌─────────────────────────────▼──────────────────────────────────┐    │
│  │                     CUSTOMER INFRASTRUCTURE                     │    │
│  │                                                                  │    │
│  │   ┌──────────────────────────────────────────────────────────┐ │    │
│  │   │                    GREMLIN DAEMON                         │ │    │
│  │   │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐  │ │    │
│  │   │  │  Command    │  │  Attack     │  │  Telemetry      │  │ │    │
│  │   │  │  Receiver   │  │  Executor   │  │  Reporter       │  │ │    │
│  │   │  └─────────────┘  └─────────────┘  └─────────────────┘  │ │    │
│  │   │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐  │ │    │
│  │   │  │  Safety     │  │  Health     │  │  Halt           │  │ │    │
│  │   │  │  Validator  │  │  Monitor    │  │  Handler        │  │ │    │
│  │   │  └─────────────┘  └─────────────┘  └─────────────────┘  │ │    │
│  │   └──────────────────────────────────────────────────────────┘ │    │
│  │                                                                  │    │
│  │   ┌──────────────────────────────────────────────────────────┐ │    │
│  │   │                    TARGET SYSTEMS                         │ │    │
│  │   │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────────┐ │ │    │
│  │   │  │   VM    │  │Container│  │ K8s Pod │  │   Lambda    │ │ │    │
│  │   │  │ Server  │  │ (Docker)│  │         │  │  Function   │ │ │    │
│  │   │  └─────────┘  └─────────┘  └─────────┘  └─────────────┘ │ │    │
│  │   └──────────────────────────────────────────────────────────┘ │    │
│  └──────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────┘

Key Architectural Components

•Gremlin Control Plane — The cloud-hosted SaaS that provides UI, API, attack orchestration, team management, and analytics. Customers interact with this layer; it never touches customer data or infrastructure directly.
•Gremlin Agent (Daemon) — Lightweight agent installed on target systems. Receives commands from the control plane, executes attacks locally, and reports telemetry. Runs with minimal privileges necessary for its attack types.
•Secure Communication Channel — TLS-encrypted, mutually authenticated connection between agents and control plane. Agents initiate outbound connections only—no inbound firewall rules required.
•Attack Executor — The component that actually implements failure injection. Uses OS-level primitives (iptables, tc, stress-ng) to create realistic failure conditions.
•Safety Validator — Continuously checks that attacks remain within safety bounds. Can automatically halt attacks if conditions exceed thresholds.
•Halt Handler — Emergency stop mechanism. Immediately terminates any running attack and reverses any state changes. Triggered manually, by safety validators, or by timeout.

Agent deployment models:

Gremlin supports multiple deployment approaches to fit different infrastructure patterns:

Gremlin Agent Deployment Options
Environment	Deployment Method	Attack Scope
VMs (EC2, GCE, Azure VMs)	Package installation (apt, yum, homebrew)	Full system: CPU, memory, disk, network
Docker Containers	Sidecar container in same network namespace	Container-scoped resource attacks
Kubernetes	DaemonSet or sidecar injection	Pod, node, or namespace-scoped attacks
AWS Lambda	Lambda Layer	Function timeout and dependency failures
Serverless (general)	Application-embedded SDK	Application-level failures

Agent Security Model

Gremlin agents are designed with minimal privilege. They only request the permissions needed for their configured attack types. An agent configured only for CPU attacks won't have network manipulation capabilities. This principle of least privilege limits blast radius if an agent is compromised.

Gremlin Attack Categories

Gremlin organizes its attacks into three primary categories, each targeting a different dimension of system behavior. Understanding these categories helps design comprehensive chaos experiments.

Resource Attacks

•CPU — Consumes CPU cycles to test behavior under computational pressure. Configurable by core count and utilization percentage. Tests: autoscaling triggers, priority scheduling, performance degradation handling.
•Memory — Allocates and holds memory to test behavior under memory pressure. Configurable by absolute size or percentage. Tests: OOM killer behavior, memory-based autoscaling, garbage collection under pressure.
•Disk — Consumes disk space or generates I/O load. Tests: disk full scenarios, I/O-intensive workload handling, log rotation under pressure.
•IO — Generates disk I/O operations independent of space consumption. Tests: I/O-bound application behavior, disk queue saturation.

Network Attacks

•Latency — Adds configurable delay to network traffic. Can target specific hosts, ports, or protocols. Tests: timeout handling, circuit breaker thresholds, user experience degradation.
•Packet Loss — Randomly drops network packets at configurable rates. Tests: TCP retransmission handling, application-level retry logic, idempotency.
•Blackhole — Completely drops traffic to specified destinations. Simulates network partition or complete service unavailability. Tests: failover behavior, circuit breakers, error handling.
•DNS — Blocks or delays DNS resolution. Tests: DNS caching, resilience to resolver failures, hard-coded IP fallbacks.

State Attacks

•Process Killer — Terminates specified processes. Can target by name, PID, or resource consumption. Tests: process restart handling, supervisor behavior, state recovery.
•Shutdown — Initiates system shutdown or reboot. Tests: boot-time behavior, instance replacement, state persistence.
•Time Travel — Shifts system clock forward or backward. Tests: time-based logic, certificate validation, scheduled job handling, token expiration.

gremlin_attack_examples.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# Gremlin CLI Attack Examples
 
# === RESOURCE ATTACKS ===
 
# CPU: Consume 80% of 2 cores for 60 seconds
gremlin attack cpu -l 60 -c 2 -p 80
 
# Memory: Consume 512MB of memory for 120 seconds
gremlin attack memory -l 120 -g 512
 
# Disk: Fill disk with 10GB over 60 seconds
gremlin attack disk -d /tmp -w 60 -b 10
 
# IO: Generate heavy disk I/O for 90 seconds
gremlin attack io -l 90 -d /var/data -c 4 -w io -s 1m
 
 
# === NETWORK ATTACKS ===
 
# Latency: Add 200ms delay to traffic to database server
gremlin attack latency -l 120 -m 200 -h db.internal.company.com -p 5432
 
# Packet Loss: 10% packet loss to external APIs
gremlin attack packet-loss -l 90 -r 10 -h api.external-service.com
 
# Blackhole: Block all traffic to payment service for 60 seconds
gremlin attack blackhole -l 60 -h payment-service.internal
 
# DNS: Block DNS resolution for 30 seconds
gremlin attack dns -l 30
 
 
# === STATE ATTACKS ===
 
# Process Kill: Terminate nginx process
gremlin attack process_killer -l 1 -p nginx -i
 
# Shutdown: Reboot the system in 60 seconds
gremlin attack shutdown -d 60 -r
 
# Time Travel: Move clock forward 2 hours
gremlin attack time_travel -l 120 -o 7200
 
 
# === TARGETED ATTACKS (Kubernetes) ===
 
# Attack a specific pod
gremlin attack cpu -l 60 -c 1 -p 50 \
    -t container \
    --container-id abc123def
 
# Attack all pods in a namespace matching a label
gremlin attack latency -l 120 -m 100 \
    -t kubernetes \
    --namespace production \
    --labels app=api-gateway
 
# Attack a percentage of pods in a deployment
gremlin attack blackhole -l 60 \
    -t kubernetes \
    --deployment api-server \
    --percent 25

Attack Combination Considerations

Be cautious when combining attacks. Memory + CPU attacks simultaneously create conditions that may trigger OOM killers unpredictably. Network latency + packet loss may amplify beyond expected behavior. Start with single-dimension attacks before exploring combinations.

Targeting and Scope Control

Gremlin's targeting system allows precise control over which systems experience chaos. This precision is essential for safe experimentation and meaningful learning.

Gremlin Targeting Dimensions
Dimension	Mechanism	Use Case
Infrastructure	Agent tags, cloud provider metadata	Target by region, availability zone, instance type
Container	Container ID, image name, labels	Target specific containers in shared hosts
Kubernetes	Namespace, deployment, pod labels, node selectors	Target within Kubernetes semantics
Service	Service mesh metadata, Consul/Eureka registration	Target by logical service, not infrastructure
Random	Percentage-based selection	Attack random subset of matching targets
Precise	Explicit target IDs	Attack exactly specified systems

Blast radius management:

Controlling blast radius—the scope of potential impact—is central to safe chaos engineering. Gremlin provides multiple mechanisms for limiting blast radius:

blast_radius_config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Example: Gremlin Scenario with Blast Radius Controls
 
name: "API Gateway Latency Test"
description: "Test API gateway behavior under database latency"
hypothesis: "API gateway circuit breakers trip at 500ms latency"
 
# Define targeting with explicit limits
targeting:
  type: kubernetes
  namespace: production
  labels:
    app: api-gateway
    tier: frontend
  
  # Blast radius controls
  limits:
    # Attack at most 2 pods
    max_targets: 2
    # OR attack at most 25% of matching pods
    max_percentage: 25
    # Prefer targets in non-critical AZs
    avoid_labels:
      critical: "true"
    # Never attack last healthy instance
    preserve_minimum: 1
 
attack:
  type: latency
  parameters:
    delay_ms: 500
    jitter_ms: 50
    # Only affect database traffic
    target_hosts:
      - "postgres.internal"
      - "redis.internal"
    target_ports:
      - 5432
      - 6379
  
  # Duration and safety limits
  duration_seconds: 120
  
  # Auto-halt conditions
  halt_conditions:
    - metric: "error_rate"
      threshold: 0.05  # 5% error rate
      window_seconds: 30
    - metric: "p99_latency_ms"
      threshold: 5000  # 5 second latency
      window_seconds: 60
 
# Scheduling
schedule:
  # Only during business hours
  allowed_hours: "09:00-17:00"
  timezone: "America/Los_Angeles"
  days: ["monday", "tuesday", "wednesday", "thursday", "friday"]
  
  # Not during maintenance windows
  blackout_periods:
    - name: "Weekly deployment window"
      schedule: "tuesday 14:00-16:00"
    - name: "Monthly maintenance"
      schedule: "first saturday 02:00-06:00"

Targeting Best Practices

•Start narrow, expand gradually — Begin with single-target attacks, then expand to percentages as confidence builds.
•Use logical targeting over physical — Target by service/deployment rather than specific instance IDs. This makes experiments reproducible across infrastructure changes.
•Apply avoid lists for critical paths — Tag systems that should never receive autonomous chaos (payment processing, authentication, etc.).
•Implement minimum instance preservation — Always preserve at least one healthy instance to prevent complete outages.
•Consider cascading effects — An attack on service A might affect dependent services B and C. Factor this into blast radius calculations.

Scenarios and Reliability Tests

Gremlin introduces Scenarios and Reliability Tests—structured approaches to chaos that go beyond individual attacks to comprehensive resilience validation.

Scenarios: Multi-step chaos experiments

Scenarios chain multiple attacks together to simulate complex real-world failure cascades. A single attack tests one failure mode; a scenario tests how failures compound.

cascade_scenario.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# Example: Database Failure Cascade Scenario
# Simulates progressive database degradation leading to failure
 
name: "Progressive Database Failure"
description: "Simulates database performance degradation before complete failure"
team: "platform-engineering"
tags: ["database", "cascade", "monthly-test"]
 
steps:
  - name: "Phase 1: Increased Latency"
    description: "Database response time increases"
    attack:
      type: latency
      targets:
        type: service
        name: "postgres-primary"
      parameters:
        delay_ms: 100
        duration_seconds: 180
    success_criteria:
      - description: "Application error rate stays below 1%"
        metric: "app_error_rate"
        condition: "< 0.01"
      - description: "Circuit breakers remain closed"
        metric: "circuit_breaker_state"
        condition: "== closed"
    
  - name: "Phase 2: Severe Latency"
    description: "Database becomes very slow"
    wait_after_previous: 30  # seconds
    attack:
      type: latency
      targets:
        type: service
        name: "postgres-primary"
      parameters:
        delay_ms: 500
        duration_seconds: 120
    success_criteria:
      - description: "Circuit breakers open"
        metric: "circuit_breaker_state"
        condition: "== open"
      - description: "Fallback cache is serving requests"
        metric: "cache_fallback_rate"
        condition: "> 0.9"
      - description: "Error rate stays acceptable"
        metric: "app_error_rate"
        condition: "< 0.05"
    
  - name: "Phase 3: Complete Failure"
    description: "Database becomes unreachable"
    wait_after_previous: 30
    attack:
      type: blackhole
      targets:
        type: service
        name: "postgres-primary"
      parameters:
        duration_seconds: 120
    success_criteria:
      - description: "Application remains functional via fallback"
        metric: "health_check_status"
        condition: "== healthy"
      - description: "User-facing error rate acceptable"
        metric: "user_visible_errors"
        condition: "< 0.10"
      - description: "Alerting triggered within 30 seconds"
        metric: "time_to_alert"
        condition: "< 30"
 
  - name: "Phase 4: Recovery"
    description: "Database becomes available again"
    wait_after_previous: 0  # Previous attack ends, this documents recovery
    attack:
      type: none  # No attack - observing recovery
      duration_seconds: 180
    success_criteria:
      - description: "Circuit breakers close within 60 seconds"
        metric: "time_to_circuit_close"
        condition: "< 60"
      - description: "Database connections re-established"
        metric: "db_connection_count"
        condition: "> 0"
      - description: "Error rate returns to baseline"
        metric: "app_error_rate"
        condition: "< 0.001"
 
overall_success_criteria:
  - "No data loss during any phase"
  - "Maximum user-visible error rate across all phases < 10%"
  - "Complete recovery within 5 minutes of database restoration"

Reliability Tests: Pre-built chaos experiments

Gremlin's Reliability Tests provide ready-made experiments for common failure scenarios. These tests encode industry best practices and make it easy to validate common resilience patterns.

Gremlin Pre-built Reliability Tests
Test Category	What It Validates	Common Findings
Host Reliability	Can services survive instance failures?	Missing auto-scaling, inadequate health checks
Service Discovery	Can services find dependencies after failure?	Hardcoded endpoints, stale cache entries
Graceful Shutdown	Do services drain connections properly?	Immediate termination, connection errors during deploy
Redundancy Validation	Is redundancy actually providing protection?	Single-AZ deployments, shared dependencies
Dependency Isolation	Does dependency failure cascade?	Missing circuit breakers, timeout issues
Data Persistence	Is data safe during infrastructure failures?	Missing replication, improper fsync usage

Start with Reliability Tests

For teams new to chaos engineering, Gremlin's pre-built Reliability Tests provide a safe starting point. They validate fundamental resilience properties without requiring deep expertise in failure mode design. Graduate to custom Scenarios once you've validated basics.

Safety Mechanisms

Gremlin's safety mechanisms differentiate professional chaos engineering from simply breaking things. These mechanisms ensure that chaos remains controlled, observable, and reversible.

Multi-Layer Safety Architecture

•Halt Button — One-click emergency stop that immediately terminates all running attacks across all targets. Available in UI, CLI, and API. Non-negotiable feature for enterprise adoption.
•Attack Duration Limits — All attacks have mandatory maximum duration. When time expires, attack automatically stops and system state is restored. Prevents runaway chaos.
•Automatic Rollback — Gremlin maintains state before attacks and automatically reverses changes on completion or halt. Network rules are removed, processes are cleaned up.
•Status Checks — Continuous health monitoring during attacks. If target systems degrade beyond thresholds, attacks can auto-halt. Configurable per attack or globally.
•Scheduling Constraints — Attacks can be time-bounded to business hours, exclude maintenance windows, or require manual approval for certain target types.
•RBAC Controls — Role-based access control limits who can perform chaos and against which targets. Junior engineers might be limited to staging; only leads can touch production.

safety_configuration.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
// Example: Gremlin Safety Configuration via API
 
interface SafetyConfiguration {
  // Global policies
  global: {
    // Maximum attack duration (prevents indefinite chaos)
    maxDurationSeconds: number;
    // Require approval for production attacks
    requireProductionApproval: boolean;
    // Auto-halt if these conditions are met
    globalHaltConditions: HaltCondition[];
  };
  
  // Team-level overrides
  teams: {
    [teamId: string]: TeamSafetyConfig;
  };
  
  // Target-level protections
  targetProtections: TargetProtection[];
}
 
interface HaltCondition {
  metric: string;
  threshold: number;
  comparison: 'gt' | 'lt' | 'eq' | 'gte' | 'lte';
  windowSeconds: number;
}
 
interface TargetProtection {
  // What to protect
  selector: {
    tags?: Record<string, string>;
    labels?: Record<string, string>;
    services?: string[];
    namespaces?: string[];
  };
  
  // Protection level
  protection: 'blocked' | 'approval-required' | 'limited';
  
  // If limited, what's allowed
  allowedAttacks?: string[];
  maxDurationSeconds?: number;
  allowedHours?: string; // cron format
}
 
// Example configuration
const safetyConfig: SafetyConfiguration = {
  global: {
    maxDurationSeconds: 3600, // 1 hour max
    requireProductionApproval: true,
    globalHaltConditions: [
      {
        metric: "critical_alerts_firing",
        threshold: 1,
        comparison: "gte",
        windowSeconds: 60
      },
      {
        metric: "deployment_in_progress",
        threshold: 1,
        comparison: "eq",
        windowSeconds: 0
      }
    ]
  },
  
  teams: {
    "platform-engineering": {
      // Platform team has broader access
      maxDurationSeconds: 7200,
      requireApprovalAbovePercentage: 50,
      allowedEnvironments: ["staging", "production"]
    },
    "product-team-a": {
      // Product teams more restricted
      maxDurationSeconds: 1800,
      requireApprovalAbovePercentage: 25,
      allowedEnvironments: ["staging", "production"],
      restrictedToOwnServices: true
    }
  },
  
  targetProtections: [
    {
      // Never chaos the payment system without approval
      selector: {
        services: ["payment-service", "billing-service"],
        labels: { "tier": "critical" }
      },
      protection: "approval-required"
    },
    {
      // Authentication can only receive limited attacks
      selector: {
        services: ["auth-service", "identity-provider"]
      },
      protection: "limited",
      allowedAttacks: ["latency", "cpu"],
      maxDurationSeconds: 300,
      allowedHours: "0 9-14 * * 1-4" // Weekdays 9AM-2PM only
    },
    {
      // Database chaos completely blocked via self-serve
      selector: {
        labels: { "component": "database" }
      },
      protection: "blocked"
    }
  ]
};

Safety Is Non-Negotiable

Chaos engineering that isn't safe isn't chaos engineering—it's just breaking things. Every mature chaos practice invests heavily in safety mechanisms. If you find yourself thinking 'we can skip this safety check,' you're not ready for that experiment.

Enterprise Features

Gremlin's enterprise features address the organizational and compliance requirements that often block chaos adoption in larger organizations.

Gremlin Enterprise Capabilities
Feature	Capability	Enterprise Need Addressed
SSO Integration	SAML 2.0, OIDC, Active Directory	Central identity management requirement
RBAC	Fine-grained permission controls	Least-privilege access, audit compliance
Audit Logging	Immutable logs of all chaos activity	SOC 2, regulatory compliance
Team Management	Isolated team environments	Multi-tenant organizations
Private Agents	On-premise control plane option	Data sovereignty, airgapped environments
SLA Reporting	Chaos correlation with SLA metrics	Executive reporting, ROI justification
Custom Integrations	Webhooks, API, CI/CD plugins	Workflow integration requirements

CI/CD Integration:

Gremlin integrates with deployment pipelines to provide automated resilience validation as part of the software delivery process.

.github/workflows/chaos-gate.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Example: Gremlin as a deployment gate in GitHub Actions
 
name: Deploy with Chaos Gate
on:
  push:
    branches: [main]
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Deploy to staging
        run: |
          kubectl apply -f k8s/staging/
          kubectl rollout status deployment/api-server -n staging
      
      - name: Wait for deployment stabilization
        run: sleep 60
      
      - name: Run Gremlin Reliability Test
        uses: gremlin/actions/reliability-test@v1
        with:
          api-key: ${{ secrets.GREMLIN_API_KEY }}
          team-id: ${{ secrets.GREMLIN_TEAM_ID }}
          test-name: "post-deploy-validation"
          # Run pre-defined reliability tests
          tests:
            - redundancy-validation
            - graceful-shutdown
            - dependency-isolation
          # Target the newly deployed service
          target:
            type: kubernetes
            namespace: staging
            deployment: api-server
          # Halt if error rate spikes
          halt-on-error-rate: 0.05
          
      - name: Promote to production
        if: success()
        run: |
          echo "Chaos gate passed - promoting to production"
          kubectl apply -f k8s/production/
          
      - name: Alert on chaos failure
        if: failure()
        run: |
          echo "Chaos gate failed - blocking production deployment"
          # Send alert via Slack/PagerDuty/etc.

Shift-Left Chaos

By integrating Gremlin into CI/CD pipelines, organizations shift chaos testing left—catching resilience issues before production rather than discovering them during incidents. This reduces incident frequency and increases deployment confidence.

Summary: Gremlin's Role in the Chaos Ecosystem

Gremlin democratized chaos engineering by transforming it from a practice requiring significant custom development into an accessible platform with enterprise-grade features.

Key Takeaways

•Gremlin solved the adoption problem — By providing a complete platform, it removed the need for organizations to build chaos tooling from scratch.
•The architecture separates control from execution — SaaS control plane provides management while agents execute locally, balancing convenience with security.
•Three attack categories cover most failure modes — Resource, network, and state attacks simulate the failures that matter most in distributed systems.
•Targeting and scope controls enable safe experimentation — Precise targeting, percentage limits, and avoid lists keep chaos bounded and useful.
•Scenarios enable complex failure simulation — Multi-step attacks simulate real-world failure cascades that single attacks cannot replicate.
•Safety mechanisms are central, not optional — Halt buttons, duration limits, automatic rollback, and RBAC make chaos safe for enterprise environments.
•Enterprise features unlock organizational adoption — SSO, audit logging, compliance reporting, and CI/CD integration address the non-technical barriers to chaos.

When to choose Gremlin:

You need chaos capabilities quickly, without building tooling
Compliance and audit requirements demand commercial solutions
You want a managed platform with professional support
Your organization spans multiple cloud providers and infrastructure types
You need RBAC and team management for multi-tenant chaos

What's next:

Gremlin represents the commercial, enterprise approach to chaos. But what about cloud-native, open-source alternatives? In the next page, we'll explore LitmusChaos, a Kubernetes-native chaos engineering framework that brings chaos to the container orchestration layer.

Page Complete

You now understand Gremlin's architecture, attack types, safety mechanisms, and enterprise features. Gremlin transformed chaos engineering from an elite practice into an accessible discipline, making resilience testing achievable for organizations of all sizes and maturities.

2 / 5

Loading learning content...

System Design (HLD)Chaos Tools

Chaos Engineering Tools

LevelAdvanced

Duration120 mins

TopicChaos Tools

2 / 5

Gremlin: Enterprise Chaos Engineering Platform

Chaos Engineering for Everyone

This is the problem Gremlin set out to solve.

Gremlin transformed chaos engineering from a practice that only engineering elite could implement into a discipline accessible to teams across the technology spectrum.

What You Will Learn

The Enterprise Chaos Problem

Before Gremlin, organizations attempting chaos engineering faced significant barriers. Understanding these barriers illuminates why Gremlin's approach was transformative.

Barriers to Chaos Engineering Adoption (Pre-Gremlin Era)
Barrier	Impact	What Organizations Did Instead
Custom tooling required	Months of engineering investment	Skipped chaos engineering entirely
Deep infrastructure knowledge needed	Only platform teams could participate	Limited to simple tests by experts
Safety mechanisms not obvious	Fear of causing production outages	Stuck to staging environments
No standardized attack types	Reinventing failure scenarios	Ad-hoc manual testing
Limited visibility into experiments	Difficult to correlate cause and effect	Blamed chaos for unrelated issues
No compliance/audit features	Enterprise approval blocked	Security teams vetoed chaos

The Netflix paradox:

Netflix's success with Chaos Monkey created an aspirational goal but didn't provide a roadmap for organizations that weren't Netflix. Many teams tried to replicate their approach and discovered that:

Building reliable chaos tooling takes significant engineering effort
Chaos tools without proper safety controls are dangerous
Getting organizational buy-in requires demonstrating value before investment
Maintaining chaos infrastructure competes with product development resources

Gremlin's founders recognized that chaos engineering needed to become a product, not a project—something organizations could adopt incrementally without building from scratch.

The Gremlin Insight

Gremlin Architecture

Gremlin Component Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
┌─────────────────────────────────────────────────────────────────────────┐
│                      GREMLIN ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────┐    │
│  │              GREMLIN CONTROL PLANE (SaaS)                       │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐ │    │
│  │  │  Web UI     │  │  REST API   │  │  Attack Orchestration   │ │    │
│  │  │  Dashboard  │  │  & CLI      │  │  Engine                 │ │    │
│  │  └─────────────┘  └─────────────┘  └─────────────────────────┘ │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐ │    │
│  │  │  Team Mgmt  │  │  Scenario   │  │  Analytics &            │ │    │
│  │  │  & RBAC     │  │  Library    │  │  Reporting              │ │    │
│  │  └─────────────┘  └─────────────┘  └─────────────────────────┘ │    │
│  └─────────────────────────────┬──────────────────────────────────┘    │
│                                │                                        │
│                    Secure TLS + mTLS Connection                        │
│                                │                                        │
│  ┌─────────────────────────────▼──────────────────────────────────┐    │
│  │                     CUSTOMER INFRASTRUCTURE                     │    │
│  │                                                                  │    │
│  │   ┌──────────────────────────────────────────────────────────┐ │    │
│  │   │                    GREMLIN DAEMON                         │ │    │
│  │   │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐  │ │    │
│  │   │  │  Command    │  │  Attack     │  │  Telemetry      │  │ │    │
│  │   │  │  Receiver   │  │  Executor   │  │  Reporter       │  │ │    │
│  │   │  └─────────────┘  └─────────────┘  └─────────────────┘  │ │    │
│  │   │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐  │ │    │
│  │   │  │  Safety     │  │  Health     │  │  Halt           │  │ │    │
│  │   │  │  Validator  │  │  Monitor    │  │  Handler        │  │ │    │
│  │   │  └─────────────┘  └─────────────┘  └─────────────────┘  │ │    │
│  │   └──────────────────────────────────────────────────────────┘ │    │
│  │                                                                  │    │
│  │   ┌──────────────────────────────────────────────────────────┐ │    │
│  │   │                    TARGET SYSTEMS                         │ │    │
│  │   │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────────┐ │ │    │
│  │   │  │   VM    │  │Container│  │ K8s Pod │  │   Lambda    │ │ │    │
│  │   │  │ Server  │  │ (Docker)│  │         │  │  Function   │ │ │    │
│  │   │  └─────────┘  └─────────┘  └─────────┘  └─────────────┘ │ │    │
│  │   └──────────────────────────────────────────────────────────┘ │    │
│  └──────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────┘

Key Architectural Components

•Gremlin Control Plane — The cloud-hosted SaaS that provides UI, API, attack orchestration, team management, and analytics. Customers interact with this layer; it never touches customer data or infrastructure directly.
•Gremlin Agent (Daemon) — Lightweight agent installed on target systems. Receives commands from the control plane, executes attacks locally, and reports telemetry. Runs with minimal privileges necessary for its attack types.
•Secure Communication Channel — TLS-encrypted, mutually authenticated connection between agents and control plane. Agents initiate outbound connections only—no inbound firewall rules required.
•Attack Executor — The component that actually implements failure injection. Uses OS-level primitives (iptables, tc, stress-ng) to create realistic failure conditions.
•Safety Validator — Continuously checks that attacks remain within safety bounds. Can automatically halt attacks if conditions exceed thresholds.
•Halt Handler — Emergency stop mechanism. Immediately terminates any running attack and reverses any state changes. Triggered manually, by safety validators, or by timeout.

Agent deployment models:

Gremlin supports multiple deployment approaches to fit different infrastructure patterns:

Gremlin Agent Deployment Options
Environment	Deployment Method	Attack Scope
VMs (EC2, GCE, Azure VMs)	Package installation (apt, yum, homebrew)	Full system: CPU, memory, disk, network
Docker Containers	Sidecar container in same network namespace	Container-scoped resource attacks
Kubernetes	DaemonSet or sidecar injection	Pod, node, or namespace-scoped attacks
AWS Lambda	Lambda Layer	Function timeout and dependency failures
Serverless (general)	Application-embedded SDK	Application-level failures

Agent Security Model

Gremlin Attack Categories

Gremlin organizes its attacks into three primary categories, each targeting a different dimension of system behavior. Understanding these categories helps design comprehensive chaos experiments.

Resource Attacks

•CPU — Consumes CPU cycles to test behavior under computational pressure. Configurable by core count and utilization percentage. Tests: autoscaling triggers, priority scheduling, performance degradation handling.
•Memory — Allocates and holds memory to test behavior under memory pressure. Configurable by absolute size or percentage. Tests: OOM killer behavior, memory-based autoscaling, garbage collection under pressure.
•Disk — Consumes disk space or generates I/O load. Tests: disk full scenarios, I/O-intensive workload handling, log rotation under pressure.
•IO — Generates disk I/O operations independent of space consumption. Tests: I/O-bound application behavior, disk queue saturation.

Network Attacks

•Latency — Adds configurable delay to network traffic. Can target specific hosts, ports, or protocols. Tests: timeout handling, circuit breaker thresholds, user experience degradation.
•Packet Loss — Randomly drops network packets at configurable rates. Tests: TCP retransmission handling, application-level retry logic, idempotency.
•Blackhole — Completely drops traffic to specified destinations. Simulates network partition or complete service unavailability. Tests: failover behavior, circuit breakers, error handling.
•DNS — Blocks or delays DNS resolution. Tests: DNS caching, resilience to resolver failures, hard-coded IP fallbacks.

State Attacks

•Process Killer — Terminates specified processes. Can target by name, PID, or resource consumption. Tests: process restart handling, supervisor behavior, state recovery.
•Shutdown — Initiates system shutdown or reboot. Tests: boot-time behavior, instance replacement, state persistence.
•Time Travel — Shifts system clock forward or backward. Tests: time-based logic, certificate validation, scheduled job handling, token expiration.

gremlin_attack_examples.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# Gremlin CLI Attack Examples
 
# === RESOURCE ATTACKS ===
 
# CPU: Consume 80% of 2 cores for 60 seconds
gremlin attack cpu -l 60 -c 2 -p 80
 
# Memory: Consume 512MB of memory for 120 seconds
gremlin attack memory -l 120 -g 512
 
# Disk: Fill disk with 10GB over 60 seconds
gremlin attack disk -d /tmp -w 60 -b 10
 
# IO: Generate heavy disk I/O for 90 seconds
gremlin attack io -l 90 -d /var/data -c 4 -w io -s 1m
 
 
# === NETWORK ATTACKS ===
 
# Latency: Add 200ms delay to traffic to database server
gremlin attack latency -l 120 -m 200 -h db.internal.company.com -p 5432
 
# Packet Loss: 10% packet loss to external APIs
gremlin attack packet-loss -l 90 -r 10 -h api.external-service.com
 
# Blackhole: Block all traffic to payment service for 60 seconds
gremlin attack blackhole -l 60 -h payment-service.internal
 
# DNS: Block DNS resolution for 30 seconds
gremlin attack dns -l 30
 
 
# === STATE ATTACKS ===
 
# Process Kill: Terminate nginx process
gremlin attack process_killer -l 1 -p nginx -i
 
# Shutdown: Reboot the system in 60 seconds
gremlin attack shutdown -d 60 -r
 
# Time Travel: Move clock forward 2 hours
gremlin attack time_travel -l 120 -o 7200
 
 
# === TARGETED ATTACKS (Kubernetes) ===
 
# Attack a specific pod
gremlin attack cpu -l 60 -c 1 -p 50 \
    -t container \
    --container-id abc123def
 
# Attack all pods in a namespace matching a label
gremlin attack latency -l 120 -m 100 \
    -t kubernetes \
    --namespace production \
    --labels app=api-gateway
 
# Attack a percentage of pods in a deployment
gremlin attack blackhole -l 60 \
    -t kubernetes \
    --deployment api-server \
    --percent 25

Attack Combination Considerations

Targeting and Scope Control

Gremlin's targeting system allows precise control over which systems experience chaos. This precision is essential for safe experimentation and meaningful learning.

Gremlin Targeting Dimensions
Dimension	Mechanism	Use Case
Infrastructure	Agent tags, cloud provider metadata	Target by region, availability zone, instance type
Container	Container ID, image name, labels	Target specific containers in shared hosts
Kubernetes	Namespace, deployment, pod labels, node selectors	Target within Kubernetes semantics
Service	Service mesh metadata, Consul/Eureka registration	Target by logical service, not infrastructure
Random	Percentage-based selection	Attack random subset of matching targets
Precise	Explicit target IDs	Attack exactly specified systems

Blast radius management:

Controlling blast radius—the scope of potential impact—is central to safe chaos engineering. Gremlin provides multiple mechanisms for limiting blast radius:

blast_radius_config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Example: Gremlin Scenario with Blast Radius Controls
 
name: "API Gateway Latency Test"
description: "Test API gateway behavior under database latency"
hypothesis: "API gateway circuit breakers trip at 500ms latency"
 
# Define targeting with explicit limits
targeting:
  type: kubernetes
  namespace: production
  labels:
    app: api-gateway
    tier: frontend
  
  # Blast radius controls
  limits:
    # Attack at most 2 pods
    max_targets: 2
    # OR attack at most 25% of matching pods
    max_percentage: 25
    # Prefer targets in non-critical AZs
    avoid_labels:
      critical: "true"
    # Never attack last healthy instance
    preserve_minimum: 1
 
attack:
  type: latency
  parameters:
    delay_ms: 500
    jitter_ms: 50
    # Only affect database traffic
    target_hosts:
      - "postgres.internal"
      - "redis.internal"
    target_ports:
      - 5432
      - 6379
  
  # Duration and safety limits
  duration_seconds: 120
  
  # Auto-halt conditions
  halt_conditions:
    - metric: "error_rate"
      threshold: 0.05  # 5% error rate
      window_seconds: 30
    - metric: "p99_latency_ms"
      threshold: 5000  # 5 second latency
      window_seconds: 60
 
# Scheduling
schedule:
  # Only during business hours
  allowed_hours: "09:00-17:00"
  timezone: "America/Los_Angeles"
  days: ["monday", "tuesday", "wednesday", "thursday", "friday"]
  
  # Not during maintenance windows
  blackout_periods:
    - name: "Weekly deployment window"
      schedule: "tuesday 14:00-16:00"
    - name: "Monthly maintenance"
      schedule: "first saturday 02:00-06:00"

Targeting Best Practices

•Start narrow, expand gradually — Begin with single-target attacks, then expand to percentages as confidence builds.
•Use logical targeting over physical — Target by service/deployment rather than specific instance IDs. This makes experiments reproducible across infrastructure changes.
•Apply avoid lists for critical paths — Tag systems that should never receive autonomous chaos (payment processing, authentication, etc.).
•Implement minimum instance preservation — Always preserve at least one healthy instance to prevent complete outages.
•Consider cascading effects — An attack on service A might affect dependent services B and C. Factor this into blast radius calculations.

Scenarios and Reliability Tests

Gremlin introduces Scenarios and Reliability Tests—structured approaches to chaos that go beyond individual attacks to comprehensive resilience validation.

Scenarios: Multi-step chaos experiments

Scenarios chain multiple attacks together to simulate complex real-world failure cascades. A single attack tests one failure mode; a scenario tests how failures compound.

cascade_scenario.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# Example: Database Failure Cascade Scenario
# Simulates progressive database degradation leading to failure
 
name: "Progressive Database Failure"
description: "Simulates database performance degradation before complete failure"
team: "platform-engineering"
tags: ["database", "cascade", "monthly-test"]
 
steps:
  - name: "Phase 1: Increased Latency"
    description: "Database response time increases"
    attack:
      type: latency
      targets:
        type: service
        name: "postgres-primary"
      parameters:
        delay_ms: 100
        duration_seconds: 180
    success_criteria:
      - description: "Application error rate stays below 1%"
        metric: "app_error_rate"
        condition: "< 0.01"
      - description: "Circuit breakers remain closed"
        metric: "circuit_breaker_state"
        condition: "== closed"
    
  - name: "Phase 2: Severe Latency"
    description: "Database becomes very slow"
    wait_after_previous: 30  # seconds
    attack:
      type: latency
      targets:
        type: service
        name: "postgres-primary"
      parameters:
        delay_ms: 500
        duration_seconds: 120
    success_criteria:
      - description: "Circuit breakers open"
        metric: "circuit_breaker_state"
        condition: "== open"
      - description: "Fallback cache is serving requests"
        metric: "cache_fallback_rate"
        condition: "> 0.9"
      - description: "Error rate stays acceptable"
        metric: "app_error_rate"
        condition: "< 0.05"
    
  - name: "Phase 3: Complete Failure"
    description: "Database becomes unreachable"
    wait_after_previous: 30
    attack:
      type: blackhole
      targets:
        type: service
        name: "postgres-primary"
      parameters:
        duration_seconds: 120
    success_criteria:
      - description: "Application remains functional via fallback"
        metric: "health_check_status"
        condition: "== healthy"
      - description: "User-facing error rate acceptable"
        metric: "user_visible_errors"
        condition: "< 0.10"
      - description: "Alerting triggered within 30 seconds"
        metric: "time_to_alert"
        condition: "< 30"
 
  - name: "Phase 4: Recovery"
    description: "Database becomes available again"
    wait_after_previous: 0  # Previous attack ends, this documents recovery
    attack:
      type: none  # No attack - observing recovery
      duration_seconds: 180
    success_criteria:
      - description: "Circuit breakers close within 60 seconds"
        metric: "time_to_circuit_close"
        condition: "< 60"
      - description: "Database connections re-established"
        metric: "db_connection_count"
        condition: "> 0"
      - description: "Error rate returns to baseline"
        metric: "app_error_rate"
        condition: "< 0.001"
 
overall_success_criteria:
  - "No data loss during any phase"
  - "Maximum user-visible error rate across all phases < 10%"
  - "Complete recovery within 5 minutes of database restoration"

Reliability Tests: Pre-built chaos experiments

Gremlin's Reliability Tests provide ready-made experiments for common failure scenarios. These tests encode industry best practices and make it easy to validate common resilience patterns.

Gremlin Pre-built Reliability Tests
Test Category	What It Validates	Common Findings
Host Reliability	Can services survive instance failures?	Missing auto-scaling, inadequate health checks
Service Discovery	Can services find dependencies after failure?	Hardcoded endpoints, stale cache entries
Graceful Shutdown	Do services drain connections properly?	Immediate termination, connection errors during deploy
Redundancy Validation	Is redundancy actually providing protection?	Single-AZ deployments, shared dependencies
Dependency Isolation	Does dependency failure cascade?	Missing circuit breakers, timeout issues
Data Persistence	Is data safe during infrastructure failures?	Missing replication, improper fsync usage

Start with Reliability Tests

Safety Mechanisms

Gremlin's safety mechanisms differentiate professional chaos engineering from simply breaking things. These mechanisms ensure that chaos remains controlled, observable, and reversible.

Multi-Layer Safety Architecture

•Halt Button — One-click emergency stop that immediately terminates all running attacks across all targets. Available in UI, CLI, and API. Non-negotiable feature for enterprise adoption.
•Attack Duration Limits — All attacks have mandatory maximum duration. When time expires, attack automatically stops and system state is restored. Prevents runaway chaos.
•Automatic Rollback — Gremlin maintains state before attacks and automatically reverses changes on completion or halt. Network rules are removed, processes are cleaned up.
•Status Checks — Continuous health monitoring during attacks. If target systems degrade beyond thresholds, attacks can auto-halt. Configurable per attack or globally.
•Scheduling Constraints — Attacks can be time-bounded to business hours, exclude maintenance windows, or require manual approval for certain target types.
•RBAC Controls — Role-based access control limits who can perform chaos and against which targets. Junior engineers might be limited to staging; only leads can touch production.

safety_configuration.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
// Example: Gremlin Safety Configuration via API
 
interface SafetyConfiguration {
  // Global policies
  global: {
    // Maximum attack duration (prevents indefinite chaos)
    maxDurationSeconds: number;
    // Require approval for production attacks
    requireProductionApproval: boolean;
    // Auto-halt if these conditions are met
    globalHaltConditions: HaltCondition[];
  };
  
  // Team-level overrides
  teams: {
    [teamId: string]: TeamSafetyConfig;
  };
  
  // Target-level protections
  targetProtections: TargetProtection[];
}
 
interface HaltCondition {
  metric: string;
  threshold: number;
  comparison: 'gt' | 'lt' | 'eq' | 'gte' | 'lte';
  windowSeconds: number;
}
 
interface TargetProtection {
  // What to protect
  selector: {
    tags?: Record<string, string>;
    labels?: Record<string, string>;
    services?: string[];
    namespaces?: string[];
  };
  
  // Protection level
  protection: 'blocked' | 'approval-required' | 'limited';
  
  // If limited, what's allowed
  allowedAttacks?: string[];
  maxDurationSeconds?: number;
  allowedHours?: string; // cron format
}
 
// Example configuration
const safetyConfig: SafetyConfiguration = {
  global: {
    maxDurationSeconds: 3600, // 1 hour max
    requireProductionApproval: true,
    globalHaltConditions: [
      {
        metric: "critical_alerts_firing",
        threshold: 1,
        comparison: "gte",
        windowSeconds: 60
      },
      {
        metric: "deployment_in_progress",
        threshold: 1,
        comparison: "eq",
        windowSeconds: 0
      }
    ]
  },
  
  teams: {
    "platform-engineering": {
      // Platform team has broader access
      maxDurationSeconds: 7200,
      requireApprovalAbovePercentage: 50,
      allowedEnvironments: ["staging", "production"]
    },
    "product-team-a": {
      // Product teams more restricted
      maxDurationSeconds: 1800,
      requireApprovalAbovePercentage: 25,
      allowedEnvironments: ["staging", "production"],
      restrictedToOwnServices: true
    }
  },
  
  targetProtections: [
    {
      // Never chaos the payment system without approval
      selector: {
        services: ["payment-service", "billing-service"],
        labels: { "tier": "critical" }
      },
      protection: "approval-required"
    },
    {
      // Authentication can only receive limited attacks
      selector: {
        services: ["auth-service", "identity-provider"]
      },
      protection: "limited",
      allowedAttacks: ["latency", "cpu"],
      maxDurationSeconds: 300,
      allowedHours: "0 9-14 * * 1-4" // Weekdays 9AM-2PM only
    },
    {
      // Database chaos completely blocked via self-serve
      selector: {
        labels: { "component": "database" }
      },
      protection: "blocked"
    }
  ]
};

Safety Is Non-Negotiable

Enterprise Features

Gremlin's enterprise features address the organizational and compliance requirements that often block chaos adoption in larger organizations.

Gremlin Enterprise Capabilities
Feature	Capability	Enterprise Need Addressed
SSO Integration	SAML 2.0, OIDC, Active Directory	Central identity management requirement
RBAC	Fine-grained permission controls	Least-privilege access, audit compliance
Audit Logging	Immutable logs of all chaos activity	SOC 2, regulatory compliance
Team Management	Isolated team environments	Multi-tenant organizations
Private Agents	On-premise control plane option	Data sovereignty, airgapped environments
SLA Reporting	Chaos correlation with SLA metrics	Executive reporting, ROI justification
Custom Integrations	Webhooks, API, CI/CD plugins	Workflow integration requirements

CI/CD Integration:

Gremlin integrates with deployment pipelines to provide automated resilience validation as part of the software delivery process.

.github/workflows/chaos-gate.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Example: Gremlin as a deployment gate in GitHub Actions
 
name: Deploy with Chaos Gate
on:
  push:
    branches: [main]
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Deploy to staging
        run: |
          kubectl apply -f k8s/staging/
          kubectl rollout status deployment/api-server -n staging
      
      - name: Wait for deployment stabilization
        run: sleep 60
      
      - name: Run Gremlin Reliability Test
        uses: gremlin/actions/reliability-test@v1
        with:
          api-key: ${{ secrets.GREMLIN_API_KEY }}
          team-id: ${{ secrets.GREMLIN_TEAM_ID }}
          test-name: "post-deploy-validation"
          # Run pre-defined reliability tests
          tests:
            - redundancy-validation
            - graceful-shutdown
            - dependency-isolation
          # Target the newly deployed service
          target:
            type: kubernetes
            namespace: staging
            deployment: api-server
          # Halt if error rate spikes
          halt-on-error-rate: 0.05
          
      - name: Promote to production
        if: success()
        run: |
          echo "Chaos gate passed - promoting to production"
          kubectl apply -f k8s/production/
          
      - name: Alert on chaos failure
        if: failure()
        run: |
          echo "Chaos gate failed - blocking production deployment"
          # Send alert via Slack/PagerDuty/etc.

Shift-Left Chaos

Summary: Gremlin's Role in the Chaos Ecosystem

Gremlin democratized chaos engineering by transforming it from a practice requiring significant custom development into an accessible platform with enterprise-grade features.

Key Takeaways

•Gremlin solved the adoption problem — By providing a complete platform, it removed the need for organizations to build chaos tooling from scratch.
•The architecture separates control from execution — SaaS control plane provides management while agents execute locally, balancing convenience with security.
•Three attack categories cover most failure modes — Resource, network, and state attacks simulate the failures that matter most in distributed systems.
•Targeting and scope controls enable safe experimentation — Precise targeting, percentage limits, and avoid lists keep chaos bounded and useful.
•Scenarios enable complex failure simulation — Multi-step attacks simulate real-world failure cascades that single attacks cannot replicate.
•Safety mechanisms are central, not optional — Halt buttons, duration limits, automatic rollback, and RBAC make chaos safe for enterprise environments.
•Enterprise features unlock organizational adoption — SSO, audit logging, compliance reporting, and CI/CD integration address the non-technical barriers to chaos.

When to choose Gremlin:

You need chaos capabilities quickly, without building tooling
Compliance and audit requirements demand commercial solutions
You want a managed platform with professional support
Your organization spans multiple cloud providers and infrastructure types
You need RBAC and team management for multi-tenant chaos

What's next:

Page Complete

2 / 5