Loading learning content...
Netflix had thousands of engineers and years of experience to build their chaos engineering practice from scratch. But what about organizations without those resources? What about companies that want the benefits of chaos engineering without building custom tooling, maintaining infrastructure, and developing in-house expertise?
This is the problem Gremlin set out to solve.
Founded in 2016 by former Netflix and Amazon engineers—including Kolton Andrus, who led chaos engineering at Amazon—Gremlin created the first commercial chaos engineering platform. Their mission: make chaos engineering accessible, safe, and valuable for any organization, regardless of size or technical sophistication.
Gremlin transformed chaos engineering from a practice that only engineering elite could implement into a discipline accessible to teams across the technology spectrum.
By the end of this page, you will understand Gremlin's architecture and operational model, master its attack types and targeting mechanisms, learn to design safe and effective chaos experiments, and evaluate when Gremlin is the right choice for your organization.
Before Gremlin, organizations attempting chaos engineering faced significant barriers. Understanding these barriers illuminates why Gremlin's approach was transformative.
| Barrier | Impact | What Organizations Did Instead |
|---|---|---|
| Custom tooling required | Months of engineering investment | Skipped chaos engineering entirely |
| Deep infrastructure knowledge needed | Only platform teams could participate | Limited to simple tests by experts |
| Safety mechanisms not obvious | Fear of causing production outages | Stuck to staging environments |
| No standardized attack types | Reinventing failure scenarios | Ad-hoc manual testing |
| Limited visibility into experiments | Difficult to correlate cause and effect | Blamed chaos for unrelated issues |
| No compliance/audit features | Enterprise approval blocked | Security teams vetoed chaos |
The Netflix paradox:
Netflix's success with Chaos Monkey created an aspirational goal but didn't provide a roadmap for organizations that weren't Netflix. Many teams tried to replicate their approach and discovered that:
Gremlin's founders recognized that chaos engineering needed to become a product, not a project—something organizations could adopt incrementally without building from scratch.
Chaos engineering's biggest obstacle wasn't technical—it was organizational. Teams weren't avoiding chaos because it was impossible; they were avoiding it because it was expensive, scary, and hard to justify. Gremlin attacked all three barriers simultaneously.
Gremlin's architecture reflects its enterprise mission: centralized control, distributed execution, comprehensive observability, and robust safety mechanisms. Understanding this architecture is essential for effective deployment and operation.
123456789101112131415161718192021222324252627282930313233343536373839404142
┌─────────────────────────────────────────────────────────────────────────┐│ GREMLIN ARCHITECTURE │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ GREMLIN CONTROL PLANE (SaaS) │ ││ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ ││ │ │ Web UI │ │ REST API │ │ Attack Orchestration │ │ ││ │ │ Dashboard │ │ & CLI │ │ Engine │ │ ││ │ └─────────────┘ └─────────────┘ └─────────────────────────┘ │ ││ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ ││ │ │ Team Mgmt │ │ Scenario │ │ Analytics & │ │ ││ │ │ & RBAC │ │ Library │ │ Reporting │ │ ││ │ └─────────────┘ └─────────────┘ └─────────────────────────┘ │ ││ └─────────────────────────────┬──────────────────────────────────┘ ││ │ ││ Secure TLS + mTLS Connection ││ │ ││ ┌─────────────────────────────▼──────────────────────────────────┐ ││ │ CUSTOMER INFRASTRUCTURE │ ││ │ │ ││ │ ┌──────────────────────────────────────────────────────────┐ │ ││ │ │ GREMLIN DAEMON │ │ ││ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │ │ ││ │ │ │ Command │ │ Attack │ │ Telemetry │ │ │ ││ │ │ │ Receiver │ │ Executor │ │ Reporter │ │ │ ││ │ │ └─────────────┘ └─────────────┘ └─────────────────┘ │ │ ││ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │ │ ││ │ │ │ Safety │ │ Health │ │ Halt │ │ │ ││ │ │ │ Validator │ │ Monitor │ │ Handler │ │ │ ││ │ │ └─────────────┘ └─────────────┘ └─────────────────┘ │ │ ││ │ └──────────────────────────────────────────────────────────┘ │ ││ │ │ ││ │ ┌──────────────────────────────────────────────────────────┐ │ ││ │ │ TARGET SYSTEMS │ │ ││ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────────┐ │ │ ││ │ │ │ VM │ │Container│ │ K8s Pod │ │ Lambda │ │ │ ││ │ │ │ Server │ │ (Docker)│ │ │ │ Function │ │ │ ││ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────────┘ │ │ ││ │ └──────────────────────────────────────────────────────────┘ │ ││ └──────────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────────┘Agent deployment models:
Gremlin supports multiple deployment approaches to fit different infrastructure patterns:
| Environment | Deployment Method | Attack Scope |
|---|---|---|
| VMs (EC2, GCE, Azure VMs) | Package installation (apt, yum, homebrew) | Full system: CPU, memory, disk, network |
| Docker Containers | Sidecar container in same network namespace | Container-scoped resource attacks |
| Kubernetes | DaemonSet or sidecar injection | Pod, node, or namespace-scoped attacks |
| AWS Lambda | Lambda Layer | Function timeout and dependency failures |
| Serverless (general) | Application-embedded SDK | Application-level failures |
Gremlin agents are designed with minimal privilege. They only request the permissions needed for their configured attack types. An agent configured only for CPU attacks won't have network manipulation capabilities. This principle of least privilege limits blast radius if an agent is compromised.
Gremlin organizes its attacks into three primary categories, each targeting a different dimension of system behavior. Understanding these categories helps design comprehensive chaos experiments.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
# Gremlin CLI Attack Examples # === RESOURCE ATTACKS === # CPU: Consume 80% of 2 cores for 60 secondsgremlin attack cpu -l 60 -c 2 -p 80 # Memory: Consume 512MB of memory for 120 secondsgremlin attack memory -l 120 -g 512 # Disk: Fill disk with 10GB over 60 secondsgremlin attack disk -d /tmp -w 60 -b 10 # IO: Generate heavy disk I/O for 90 secondsgremlin attack io -l 90 -d /var/data -c 4 -w io -s 1m # === NETWORK ATTACKS === # Latency: Add 200ms delay to traffic to database servergremlin attack latency -l 120 -m 200 -h db.internal.company.com -p 5432 # Packet Loss: 10% packet loss to external APIsgremlin attack packet-loss -l 90 -r 10 -h api.external-service.com # Blackhole: Block all traffic to payment service for 60 secondsgremlin attack blackhole -l 60 -h payment-service.internal # DNS: Block DNS resolution for 30 secondsgremlin attack dns -l 30 # === STATE ATTACKS === # Process Kill: Terminate nginx processgremlin attack process_killer -l 1 -p nginx -i # Shutdown: Reboot the system in 60 secondsgremlin attack shutdown -d 60 -r # Time Travel: Move clock forward 2 hoursgremlin attack time_travel -l 120 -o 7200 # === TARGETED ATTACKS (Kubernetes) === # Attack a specific podgremlin attack cpu -l 60 -c 1 -p 50 \ -t container \ --container-id abc123def # Attack all pods in a namespace matching a labelgremlin attack latency -l 120 -m 100 \ -t kubernetes \ --namespace production \ --labels app=api-gateway # Attack a percentage of pods in a deploymentgremlin attack blackhole -l 60 \ -t kubernetes \ --deployment api-server \ --percent 25Be cautious when combining attacks. Memory + CPU attacks simultaneously create conditions that may trigger OOM killers unpredictably. Network latency + packet loss may amplify beyond expected behavior. Start with single-dimension attacks before exploring combinations.
Gremlin's targeting system allows precise control over which systems experience chaos. This precision is essential for safe experimentation and meaningful learning.
| Dimension | Mechanism | Use Case |
|---|---|---|
| Infrastructure | Agent tags, cloud provider metadata | Target by region, availability zone, instance type |
| Container | Container ID, image name, labels | Target specific containers in shared hosts |
| Kubernetes | Namespace, deployment, pod labels, node selectors | Target within Kubernetes semantics |
| Service | Service mesh metadata, Consul/Eureka registration | Target by logical service, not infrastructure |
| Random | Percentage-based selection | Attack random subset of matching targets |
| Precise | Explicit target IDs | Attack exactly specified systems |
Blast radius management:
Controlling blast radius—the scope of potential impact—is central to safe chaos engineering. Gremlin provides multiple mechanisms for limiting blast radius:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
# Example: Gremlin Scenario with Blast Radius Controls name: "API Gateway Latency Test"description: "Test API gateway behavior under database latency"hypothesis: "API gateway circuit breakers trip at 500ms latency" # Define targeting with explicit limitstargeting: type: kubernetes namespace: production labels: app: api-gateway tier: frontend # Blast radius controls limits: # Attack at most 2 pods max_targets: 2 # OR attack at most 25% of matching pods max_percentage: 25 # Prefer targets in non-critical AZs avoid_labels: critical: "true" # Never attack last healthy instance preserve_minimum: 1 attack: type: latency parameters: delay_ms: 500 jitter_ms: 50 # Only affect database traffic target_hosts: - "postgres.internal" - "redis.internal" target_ports: - 5432 - 6379 # Duration and safety limits duration_seconds: 120 # Auto-halt conditions halt_conditions: - metric: "error_rate" threshold: 0.05 # 5% error rate window_seconds: 30 - metric: "p99_latency_ms" threshold: 5000 # 5 second latency window_seconds: 60 # Schedulingschedule: # Only during business hours allowed_hours: "09:00-17:00" timezone: "America/Los_Angeles" days: ["monday", "tuesday", "wednesday", "thursday", "friday"] # Not during maintenance windows blackout_periods: - name: "Weekly deployment window" schedule: "tuesday 14:00-16:00" - name: "Monthly maintenance" schedule: "first saturday 02:00-06:00"Gremlin introduces Scenarios and Reliability Tests—structured approaches to chaos that go beyond individual attacks to comprehensive resilience validation.
Scenarios: Multi-step chaos experiments
Scenarios chain multiple attacks together to simulate complex real-world failure cascades. A single attack tests one failure mode; a scenario tests how failures compound.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
# Example: Database Failure Cascade Scenario# Simulates progressive database degradation leading to failure name: "Progressive Database Failure"description: "Simulates database performance degradation before complete failure"team: "platform-engineering"tags: ["database", "cascade", "monthly-test"] steps: - name: "Phase 1: Increased Latency" description: "Database response time increases" attack: type: latency targets: type: service name: "postgres-primary" parameters: delay_ms: 100 duration_seconds: 180 success_criteria: - description: "Application error rate stays below 1%" metric: "app_error_rate" condition: "< 0.01" - description: "Circuit breakers remain closed" metric: "circuit_breaker_state" condition: "== closed" - name: "Phase 2: Severe Latency" description: "Database becomes very slow" wait_after_previous: 30 # seconds attack: type: latency targets: type: service name: "postgres-primary" parameters: delay_ms: 500 duration_seconds: 120 success_criteria: - description: "Circuit breakers open" metric: "circuit_breaker_state" condition: "== open" - description: "Fallback cache is serving requests" metric: "cache_fallback_rate" condition: "> 0.9" - description: "Error rate stays acceptable" metric: "app_error_rate" condition: "< 0.05" - name: "Phase 3: Complete Failure" description: "Database becomes unreachable" wait_after_previous: 30 attack: type: blackhole targets: type: service name: "postgres-primary" parameters: duration_seconds: 120 success_criteria: - description: "Application remains functional via fallback" metric: "health_check_status" condition: "== healthy" - description: "User-facing error rate acceptable" metric: "user_visible_errors" condition: "< 0.10" - description: "Alerting triggered within 30 seconds" metric: "time_to_alert" condition: "< 30" - name: "Phase 4: Recovery" description: "Database becomes available again" wait_after_previous: 0 # Previous attack ends, this documents recovery attack: type: none # No attack - observing recovery duration_seconds: 180 success_criteria: - description: "Circuit breakers close within 60 seconds" metric: "time_to_circuit_close" condition: "< 60" - description: "Database connections re-established" metric: "db_connection_count" condition: "> 0" - description: "Error rate returns to baseline" metric: "app_error_rate" condition: "< 0.001" overall_success_criteria: - "No data loss during any phase" - "Maximum user-visible error rate across all phases < 10%" - "Complete recovery within 5 minutes of database restoration"Reliability Tests: Pre-built chaos experiments
Gremlin's Reliability Tests provide ready-made experiments for common failure scenarios. These tests encode industry best practices and make it easy to validate common resilience patterns.
| Test Category | What It Validates | Common Findings |
|---|---|---|
| Host Reliability | Can services survive instance failures? | Missing auto-scaling, inadequate health checks |
| Service Discovery | Can services find dependencies after failure? | Hardcoded endpoints, stale cache entries |
| Graceful Shutdown | Do services drain connections properly? | Immediate termination, connection errors during deploy |
| Redundancy Validation | Is redundancy actually providing protection? | Single-AZ deployments, shared dependencies |
| Dependency Isolation | Does dependency failure cascade? | Missing circuit breakers, timeout issues |
| Data Persistence | Is data safe during infrastructure failures? | Missing replication, improper fsync usage |
For teams new to chaos engineering, Gremlin's pre-built Reliability Tests provide a safe starting point. They validate fundamental resilience properties without requiring deep expertise in failure mode design. Graduate to custom Scenarios once you've validated basics.
Gremlin's safety mechanisms differentiate professional chaos engineering from simply breaking things. These mechanisms ensure that chaos remains controlled, observable, and reversible.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
// Example: Gremlin Safety Configuration via API interface SafetyConfiguration { // Global policies global: { // Maximum attack duration (prevents indefinite chaos) maxDurationSeconds: number; // Require approval for production attacks requireProductionApproval: boolean; // Auto-halt if these conditions are met globalHaltConditions: HaltCondition[]; }; // Team-level overrides teams: { [teamId: string]: TeamSafetyConfig; }; // Target-level protections targetProtections: TargetProtection[];} interface HaltCondition { metric: string; threshold: number; comparison: 'gt' | 'lt' | 'eq' | 'gte' | 'lte'; windowSeconds: number;} interface TargetProtection { // What to protect selector: { tags?: Record<string, string>; labels?: Record<string, string>; services?: string[]; namespaces?: string[]; }; // Protection level protection: 'blocked' | 'approval-required' | 'limited'; // If limited, what's allowed allowedAttacks?: string[]; maxDurationSeconds?: number; allowedHours?: string; // cron format} // Example configurationconst safetyConfig: SafetyConfiguration = { global: { maxDurationSeconds: 3600, // 1 hour max requireProductionApproval: true, globalHaltConditions: [ { metric: "critical_alerts_firing", threshold: 1, comparison: "gte", windowSeconds: 60 }, { metric: "deployment_in_progress", threshold: 1, comparison: "eq", windowSeconds: 0 } ] }, teams: { "platform-engineering": { // Platform team has broader access maxDurationSeconds: 7200, requireApprovalAbovePercentage: 50, allowedEnvironments: ["staging", "production"] }, "product-team-a": { // Product teams more restricted maxDurationSeconds: 1800, requireApprovalAbovePercentage: 25, allowedEnvironments: ["staging", "production"], restrictedToOwnServices: true } }, targetProtections: [ { // Never chaos the payment system without approval selector: { services: ["payment-service", "billing-service"], labels: { "tier": "critical" } }, protection: "approval-required" }, { // Authentication can only receive limited attacks selector: { services: ["auth-service", "identity-provider"] }, protection: "limited", allowedAttacks: ["latency", "cpu"], maxDurationSeconds: 300, allowedHours: "0 9-14 * * 1-4" // Weekdays 9AM-2PM only }, { // Database chaos completely blocked via self-serve selector: { labels: { "component": "database" } }, protection: "blocked" } ]};Chaos engineering that isn't safe isn't chaos engineering—it's just breaking things. Every mature chaos practice invests heavily in safety mechanisms. If you find yourself thinking 'we can skip this safety check,' you're not ready for that experiment.
Gremlin's enterprise features address the organizational and compliance requirements that often block chaos adoption in larger organizations.
| Feature | Capability | Enterprise Need Addressed |
|---|---|---|
| SSO Integration | SAML 2.0, OIDC, Active Directory | Central identity management requirement |
| RBAC | Fine-grained permission controls | Least-privilege access, audit compliance |
| Audit Logging | Immutable logs of all chaos activity | SOC 2, regulatory compliance |
| Team Management | Isolated team environments | Multi-tenant organizations |
| Private Agents | On-premise control plane option | Data sovereignty, airgapped environments |
| SLA Reporting | Chaos correlation with SLA metrics | Executive reporting, ROI justification |
| Custom Integrations | Webhooks, API, CI/CD plugins | Workflow integration requirements |
CI/CD Integration:
Gremlin integrates with deployment pipelines to provide automated resilience validation as part of the software delivery process.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
# Example: Gremlin as a deployment gate in GitHub Actions name: Deploy with Chaos Gateon: push: branches: [main] jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Deploy to staging run: | kubectl apply -f k8s/staging/ kubectl rollout status deployment/api-server -n staging - name: Wait for deployment stabilization run: sleep 60 - name: Run Gremlin Reliability Test uses: gremlin/actions/reliability-test@v1 with: api-key: ${{ secrets.GREMLIN_API_KEY }} team-id: ${{ secrets.GREMLIN_TEAM_ID }} test-name: "post-deploy-validation" # Run pre-defined reliability tests tests: - redundancy-validation - graceful-shutdown - dependency-isolation # Target the newly deployed service target: type: kubernetes namespace: staging deployment: api-server # Halt if error rate spikes halt-on-error-rate: 0.05 - name: Promote to production if: success() run: | echo "Chaos gate passed - promoting to production" kubectl apply -f k8s/production/ - name: Alert on chaos failure if: failure() run: | echo "Chaos gate failed - blocking production deployment" # Send alert via Slack/PagerDuty/etc.By integrating Gremlin into CI/CD pipelines, organizations shift chaos testing left—catching resilience issues before production rather than discovering them during incidents. This reduces incident frequency and increases deployment confidence.
Gremlin democratized chaos engineering by transforming it from a practice requiring significant custom development into an accessible platform with enterprise-grade features.
When to choose Gremlin:
What's next:
Gremlin represents the commercial, enterprise approach to chaos. But what about cloud-native, open-source alternatives? In the next page, we'll explore LitmusChaos, a Kubernetes-native chaos engineering framework that brings chaos to the container orchestration layer.
You now understand Gremlin's architecture, attack types, safety mechanisms, and enterprise features. Gremlin transformed chaos engineering from an elite practice into an accessible discipline, making resilience testing achievable for organizations of all sizes and maturities.