Loading learning content...
In 2010, Netflix made a decision that seemed counterintuitive—perhaps even reckless—to the rest of the technology industry. They created a tool designed to randomly terminate virtual machines in their production environment. Not in staging. Not in testing. In production, where millions of subscribers were actively streaming content.
They called it Chaos Monkey.
The name was deliberate. Imagine a wild monkey loose in your data center, randomly unplugging servers, yanking cables, and causing mayhem. That's exactly what Chaos Monkey simulates—random infrastructure failures that force systems to prove they can survive chaos.
What seemed like madness was actually profound engineering wisdom. By 2011, when Amazon Web Services experienced a major outage in the US-East region that brought down Reddit, Quora, Foursquare, and countless other services, Netflix—running entirely on AWS—remained operational. Their secret? They had been training for exactly this scenario, every single day, courtesy of their mischievous monkey.
By the end of this page, you will understand Chaos Monkey's architecture, operational mechanics, and the design principles that made it revolutionary. You'll learn how to implement similar chaos at your organization, and understand why randomly killing production instances is actually safer than not doing so.
To understand Chaos Monkey, we must first understand the context from which it emerged. In 2008, Netflix began one of the most ambitious infrastructure migrations in technology history: moving from their own data centers to Amazon Web Services.
This wasn't a minor technical adjustment—it was a fundamental reconceptualization of how Netflix operated. The company was transitioning from vertically scaled, carefully maintained physical servers to horizontally scaled, disposable virtual machines.
| Traditional Data Center | AWS (Cloud) | Implications |
|---|---|---|
| Physical servers, high capital cost | Virtual machines, pay-per-use | Individual instances become disposable |
| Hardware failures rare but catastrophic | Instance failures common but isolated | Design for failure becomes mandatory |
| Vertical scaling (bigger machines) | Horizontal scaling (more machines) | State must be externalized |
| Long-lived servers (years) | Ephemeral instances (hours to days) | Cannot rely on instance persistence |
| Manual maintenance windows | Automatic scaling events | Systems must handle dynamic capacity |
The fundamental insight:
Netflix's engineering leadership recognized a profound truth: in a cloud environment, failure isn't an exceptional event—it's a constant condition. Virtual machines terminate unexpectedly. Network partitions occur. Services degrade. Availability zones experience issues.
The question wasn't if these failures would happen, but when—and whether their systems would survive them.
Traditional approaches to reliability focused on preventing failures: redundant power supplies, RAID arrays, enterprise-grade hardware. But in the cloud, you can't prevent AWS from terminating your instance. You can only design systems that don't care when it happens.
Chaos Monkey emerged from a critical realization: the only way to know if your system survives failure is to actually fail. Testing durability in staging environments provides false confidence because staging never perfectly mirrors production's complexity, scale, and real-world conditions.
The birth of intentional chaos:
Greg Orzell, Cory Bennett, and the Netflix engineering team created Chaos Monkey in 2010 as a direct response to this new operational reality. The core idea was elegantly simple:
This created a continuous resilience verification loop—a system that constantly tested Netflix's ability to survive exactly the kind of failures they knew would occur in production.
Chaos Monkey's architecture reflects its purpose: controlled, observable, and safely-bounded destruction. Understanding this architecture is essential for implementing similar chaos capabilities in any organization.
Core Components:
Chaos Monkey integrates with Netflix's cloud infrastructure through several key components, each playing a specific role in the chaos workflow.
1234567891011121314151617181920212223242526272829303132333435
┌─────────────────────────────────────────────────────────────────────┐│ CHAOS MONKEY ARCHITECTURE │├─────────────────────────────────────────────────────────────────────┤│ ││ ┌──────────────┐ ┌─────────────────┐ ┌─────────────────┐ ││ │ Scheduler │────▶│ Chaos Monkey │────▶│ AWS/Spinnaker │ ││ │ (Cron) │ │ Core Engine │ │ API │ ││ └──────────────┘ └────────┬────────┘ └─────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────┐ ││ │ Discovery Layer │ ││ │ (Eureka/Consul) │ ││ └──────────┬──────────┘ ││ │ ││ ┌─────────────────────┼─────────────────────┐ ││ ▼ ▼ ▼ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ App Group │ │ App Group │ │ App Group │ ││ │ (ASG-A) │ │ (ASG-B) │ │ (ASG-C) │ ││ ├─────────────┤ ├─────────────┤ ├─────────────┤ ││ │ Instance 1 │ │ Instance 1 │ │ Instance 1 │ ││ │ Instance 2 │ │ Instance 2 │ │ Instance 2 │ ││ │ Instance 3 │◀─ TERMINATION ─────────────┘ ││ │ Instance 4 │ (Random Selection) ││ └─────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ SAFETY CONTROLS │ ││ │ • Time Windows (Business hours only) │ ││ │ • Opt-out Groups (Critical services) │ ││ │ • Kill Limits (Max terminations per run) │ ││ │ • Deployment Awareness (Skip during deploys) │ ││ └─────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────┘The Selection Algorithm:
Chaos Monkey doesn't just pick any random instance. Its selection algorithm is carefully designed to provide useful signal while minimizing blast radius.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
// Simplified Chaos Monkey Selection Algorithm function selectVictim(): // Step 1: Get all eligible applications applications = discoveryService.getAllApplications() // Step 2: Filter out opted-out applications eligible_apps = applications.filter(app => !app.hasChaosMonkeyOptOut() && !app.isCurrentlyDeploying() && app.instanceCount() > app.minimumInstanceCount() ) // Step 3: Randomly select an application selected_app = random.choice(eligible_apps) // Step 4: Get instances for selected application instances = selected_app.getHealthyInstances() // Step 5: Filter by availability zone balance // Avoid terminating if it would create AZ imbalance balanced_instances = instances.filter(instance => !wouldCauseAZImbalance(selected_app, instance) ) // Step 6: Apply additional safety filters safe_instances = balanced_instances.filter(instance => !instance.hasActiveConnections(threshold: HIGH) && !instance.wasRecentlyLaunched(minutes: 10) && !instance.isLastHealthyInstance() ) // Step 7: Final random selection if safe_instances.isEmpty(): return null // No safe target this run return random.choice(safe_instances) function executeTermination(victim): // Log intent before action auditLog.record({ action: "TERMINATION_INITIATED", instance: victim.id, application: victim.app, timestamp: now(), reason: "Chaos Monkey scheduled run" }) // Execute termination via cloud API cloudProvider.terminateInstance(victim.id) // Log completion auditLog.record({ action: "TERMINATION_COMPLETED", instance: victim.id, timestamp: now() }) // Notify monitoring systems monitoring.recordChaosEvent(victim)Notice how the algorithm includes multiple safety checks. Chaos Monkey will skip runs entirely if no safe targets exist. This is a crucial principle: chaos engineering tools should never make systems less safe than they would be without the tool.
Understanding how Chaos Monkey operates day-to-day is essential for anyone considering similar chaos practices. The operational model balances aggression with safety—constantly testing resilience while preventing unnecessary outages.
| Parameter | Netflix Default | Purpose |
|---|---|---|
| Run Frequency | Once per business day | Frequent enough to catch regressions, rare enough to be manageable |
| Time Window | 9 AM - 3 PM local | Engineers available to respond; avoids peak streaming hours |
| Kill Probability | 1.0 (always kill something) | Ensures continuous validation; can be reduced for sensitive apps |
| Max Kills Per Run | 1 per app group | Limits blast radius; prevents cascade of terminations |
| Min Instances Required | 2+ | Never terminates the last healthy instance |
| Cool-down Period | 1 hour minimum | Prevents rapid successive terminations of same app |
The Daily Chaos Cycle:
Chaos Monkey operates in a predictable rhythm that teams learn to expect and prepare for.
Chaos Monkey assumes that Auto Scaling Groups and load balancers are functioning correctly. If ASG replacement or load balancer health checks are misconfigured, Chaos Monkey will expose this—sometimes dramatically. This is a feature, not a bug: better to discover ASG issues during a controlled chaos event than during an actual incident.
Integration with Deployment Pipeline:
At Netflix, Chaos Monkey integrates with Spinnaker (their deployment platform) to avoid chaos during sensitive periods.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
// Example: Chaos Monkey deployment awareness interface DeploymentStatus { application: string; stage: 'ROLLING' | 'CANARY' | 'COMPLETE' | 'FAILED'; startTime: Date; instances: { new: string[]; old: string[]; };} class ChaosMonkeyDeploymentGuard { private deploymentTracker: SpinnakerClient; async shouldAllowChaos(application: string): Promise<boolean> { const deployment = await this.deploymentTracker.getCurrentDeployment(application); if (!deployment) { return true; // No active deployment, chaos allowed } // Never chaos during active rollout if (deployment.stage === 'ROLLING') { console.log(`Skipping chaos for ${application}: deployment in progress`); return false; } // Never chaos during canary analysis if (deployment.stage === 'CANARY') { console.log(`Skipping chaos for ${application}: canary in progress`); return false; } // Allow chaos after deployment completes, but only after stabilization period if (deployment.stage === 'COMPLETE') { const stabilizationPeriod = 30 * 60 * 1000; // 30 minutes const timeSinceComplete = Date.now() - deployment.startTime.getTime(); if (timeSinceComplete < stabilizationPeriod) { console.log(`Skipping chaos for ${application}: stabilization period`); return false; } } return true; } // When Chaos Monkey considers terminating a specific instance async shouldTerminateInstance(instanceId: string): Promise<boolean> { const deployment = await this.deploymentTracker.findDeploymentWithInstance(instanceId); if (!deployment) { return true; } // Only terminate old instances during rollout, not new ones // This lets us test if the old version survives reduced capacity if (deployment.instances.new.includes(instanceId)) { console.log(`Protecting new instance ${instanceId} during deployment`); return false; } return true; }}Chaos Monkey was just the beginning. Its success led Netflix to create an entire Simian Army—a collection of chaos tools, each testing a different dimension of resilience. Understanding this family of tools reveals the full scope of chaos engineering.
| Simian | What It Tests | Failure Mode Simulated |
|---|---|---|
| Chaos Monkey | Instance resilience | Random instance termination |
| Chaos Kong | Regional resilience | Entire AWS region evacuation |
| Chaos Gorilla | Availability Zone resilience | Full AZ failure |
| Latency Monkey | Latency tolerance | Artificial delay injection |
| Doctor Monkey | Health check accuracy | Instance health degradation |
| Janitor Monkey | Resource hygiene | Cleanup of unused resources |
| Conformity Monkey | Configuration compliance | Detection of non-conforming instances |
| Security Monkey | Security posture | Detection of security vulnerabilities |
| 10-18 Monkey | Internationalization | Locale-specific failure detection |
The Simian Army demonstrates a key principle: start simple and progressively increase chaos scope. Netflix didn't start with Chaos Kong (region-level failures). They built confidence with Chaos Monkey first, then expanded to larger blast radii as their systems—and their organizational maturity—proved ready.
Chaos Kong: The Ultimate Test
Chaos Kong represents the pinnacle of Netflix's chaos engineering. It simulates the complete failure of an entire AWS region—a scenario that has actually occurred in the real world (US-East-1 outages in 2011, 2015, 2017).
When Chaos Kong runs, it doesn't actually destroy the region. Instead, it:
Chaos Kong runs are carefully scheduled events with engineering leadership oversight—they're closer to GameDays than to autonomous chaos like Chaos Monkey.
Why instance termination isn't enough:
Sophisticated systems can survive instance termination while being vulnerable to more subtle failures:
The Simian Army addresses this by providing chaos tools for each failure mode that matters in distributed systems.
Chaos Monkey is open source and available for organizations to adopt. However, deploying it requires more than just running the software—it demands organizational readiness and infrastructure prerequisites.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
# Example Chaos Monkey Configuration# Based on Spinnaker/Chaos Monkey integration chaosMonkey: # Global enable/disable enabled: true # Schedule configuration schedule: # When chaos can run timezone: "America/Los_Angeles" startHour: 9 endHour: 15 daysOfWeek: - MONDAY - TUESDAY - WEDNESDAY - THURSDAY - FRIDAY # How often to run frequency: HOURLY # Options: HOURLY, DAILY, WEEKLY # Safety controls safety: # Minimum instances before chaos can terminate minInstancesPerASG: 2 # Maximum percentage of instances to terminate per group maxTerminationPercentage: 50 # Cooldown after termination before next is allowed cooldownMinutes: 60 # Opt-out mechanism optOutTagKey: "chaosmonkey.optout" optOutTagValue: "true" # Always protect these application patterns protectedApplicationPatterns: - ".*-critical$" - "^auth-.*" - "^payment-.*" # Cloud provider configuration aws: region: "us-west-2" # Use IAM role or explicit credentials useInstanceProfile: true # Notification configuration notifications: slack: enabled: true channel: "#chaos-events" webhookUrl: "${SLACK_WEBHOOK_URL}" email: enabled: true recipients: - "sre-team@company.com" onlyOnError: true # Audit logging audit: destination: "cloudwatch" logGroup: "/chaos-monkey/audit" retentionDays: 90Phased Rollout Strategy:
No organization should enable Chaos Monkey globally on day one. A phased approach builds confidence while limiting risk.
| Phase | Duration | Scope | Success Criteria |
|---|---|---|---|
| 1: Internal Only | 2-4 weeks | Non-production environments only | Teams comfortable with chaos concept |
| 2: Volunteer Services | 4-8 weeks | Services whose teams opt-in | Multiple services survive chaos uneventfully |
| 3: Expanded Coverage | 8-16 weeks | Default opt-in, with opt-out available | Most services survive; issues rare and quickly resolved |
| 4: Mandatory (Non-Critical) | Ongoing | All non-critical services, no opt-out | Chaos events are routine, non-incidents |
| 5: Full Coverage | Ongoing | All services including critical paths | Organization has proven chaos resilience |
Mistake #1: Enabling chaos before services have redundancy. Mistake #2: Running chaos without observability to understand impact. Mistake #3: Not communicating chaos schedules to affected teams. Mistake #4: Treating first chaos failures as emergencies rather than learning opportunities.
Chaos without observation is just destruction. The value of Chaos Monkey comes from what you learn from each termination. This requires robust metrics and observability practices.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
// Example: Chaos Monkey metrics collection and dashboard integration interface ChaosEvent { eventId: string; timestamp: Date; application: string; instanceId: string; region: string; availabilityZone: string;} interface ChaosMetrics { // Time measurements timeToDetection: number; // Seconds until monitoring detected instance loss timeToReplacement: number; // Seconds until new instance launched timeToHealthy: number; // Seconds until new instance serving traffic totalRecoveryTime: number; // End-to-end recovery time // Impact measurements errorRateBefore: number; // Baseline error rate (15min before) errorRateDuring: number; // Error rate during chaos window errorRateAfter: number; // Error rate post-recovery latencyP50Before: number; latencyP50During: number; latencyP95Before: number; latencyP95During: number; // Capacity measurements instanceCountBefore: number; instanceCountMinimum: number; // Lowest point during chaos instanceCountAfter: number; // Traffic measurements requestRateBefore: number; requestRateDuring: number; failedRequestsDuring: number;} class ChaosMetricsCollector { private prometheus: PrometheusClient; async collectMetrics(event: ChaosEvent): Promise<ChaosMetrics> { const windowStart = new Date(event.timestamp.getTime() - 15 * 60 * 1000); const windowEnd = new Date(event.timestamp.getTime() + 30 * 60 * 1000); // Query metrics around chaos event const [ errorRates, latencies, instanceCounts, requestRates ] = await Promise.all([ this.prometheus.query(` rate(http_requests_total{app="${event.application}", status=~"5.."}[1m]) / rate(http_requests_total{app="${event.application}"}[1m]) `, windowStart, windowEnd), this.prometheus.query(` histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{app="${event.application}"}[1m]) ) `, windowStart, windowEnd), this.prometheus.query(` count(up{app="${event.application}"} == 1) `, windowStart, windowEnd), this.prometheus.query(` rate(http_requests_total{app="${event.application}"}[1m]) `, windowStart, windowEnd) ]); // Calculate recovery timing const recoveryTimestamp = await this.findRecoveryPoint(event, instanceCounts); return { timeToDetection: this.calculateDetectionTime(event, instanceCounts), timeToReplacement: this.calculateReplacementTime(event, instanceCounts), timeToHealthy: this.calculateHealthyTime(event, instanceCounts), totalRecoveryTime: (recoveryTimestamp.getTime() - event.timestamp.getTime()) / 1000, errorRateBefore: this.average(errorRates, windowStart, event.timestamp), errorRateDuring: this.max(errorRates, event.timestamp, recoveryTimestamp), errorRateAfter: this.average(errorRates, recoveryTimestamp, windowEnd), latencyP50Before: this.percentile(latencies, 50, windowStart, event.timestamp), latencyP50During: this.percentile(latencies, 50, event.timestamp, recoveryTimestamp), latencyP95Before: this.percentile(latencies, 95, windowStart, event.timestamp), latencyP95During: this.percentile(latencies, 95, event.timestamp, recoveryTimestamp), instanceCountBefore: instanceCounts.at(event.timestamp), instanceCountMinimum: this.min(instanceCounts, event.timestamp, recoveryTimestamp), instanceCountAfter: instanceCounts.at(windowEnd), requestRateBefore: this.average(requestRates, windowStart, event.timestamp), requestRateDuring: this.average(requestRates, event.timestamp, recoveryTimestamp), failedRequestsDuring: this.calculateFailedRequests(errorRates, requestRates, event.timestamp, recoveryTimestamp) }; } // Calculate chaos health score: 0-100 (higher is better) calculateHealthScore(metrics: ChaosMetrics): number { let score = 100; // Penalize slow recovery (target: < 300 seconds) if (metrics.totalRecoveryTime > 300) { score -= Math.min(30, (metrics.totalRecoveryTime - 300) / 10); } // Penalize error rate increase const errorRateIncrease = metrics.errorRateDuring - metrics.errorRateBefore; if (errorRateIncrease > 0.001) { // > 0.1% increase score -= Math.min(40, errorRateIncrease * 4000); } // Penalize latency spike const latencyIncrease = metrics.latencyP95During / metrics.latencyP95Before; if (latencyIncrease > 1.2) { // > 20% increase score -= Math.min(20, (latencyIncrease - 1) * 20); } // Penalize failed requests if (metrics.failedRequestsDuring > 0) { score -= Math.min(10, metrics.failedRequestsDuring / 10); } return Math.max(0, Math.round(score)); }}Annotate your observability dashboards with chaos events. When reviewing any incident, being able to see 'Chaos Monkey terminated instance-xyz at 10:23' provides essential context. Without this correlation, chaos-induced issues look like mysterious production problems.
Chaos Monkey's greatest impact isn't technical—it's cultural. The tool forces a fundamental shift in how organizations think about reliability, ownership, and the relationship between development and operations.
The forcing function effect:
Chaos Monkey acts as a powerful forcing function for good engineering practices. When developers know their code will face random instance terminations:
This proactive mindset is invaluable—and extremely difficult to achieve through policies alone. Chaos Monkey makes resilience tangible and immediate.
"The best way to avoid failure is to fail constantly." — Netflix Engineering
This paradox captures the essence of chaos engineering: by experiencing small, controlled failures constantly, you build systems that survive large, uncontrolled failures when they inevitably occur.
Measuring cultural shift:
Organizations adopting Chaos Monkey can measure cultural transformation through several indicators:
Chaos Monkey transformed how the technology industry thinks about reliability. Its influence extends far beyond Netflix, spawning an entire discipline—chaos engineering—and inspiring tools, practices, and cultural shifts at organizations worldwide.
What's next:
Chaos Monkey proved that intentional failure is a viable—even essential—reliability practice. This opened the door for more sophisticated tools designed for broader adoption. In the next page, we'll explore Gremlin, an enterprise chaos engineering platform that makes chaos accessible to organizations without Netflix's engineering resources.
You now understand Chaos Monkey's origins, architecture, operational mechanics, and lasting impact on the industry. This pioneering tool demonstrated that the only way to know if systems survive failure is to actually fail them—a lesson that continues to shape how we build resilient systems today.