Loading learning content...
The third principle of chaos engineering is also its most controversial: Run experiments in production.
This principle generates immediate pushback. It sounds reckless. It seems irresponsible. Breaking production on purpose? At companies where downtime costs millions per hour? Where engineers get paged at 3 AM for any anomaly?
Yet this principle sits at the heart of chaos engineering's value proposition. Testing in staging environments—no matter how sophisticated—cannot fully validate production resilience. The differences between staging and production are exactly the kinds of conditions that cause real incidents: data scale, traffic patterns, configuration drift, third-party behavior.
The key insight is that you're already running experiments in production—you just call them "deployments" and "incidents." Chaos engineering formalizes and controls this experimentation, making it deliberate rather than accidental.
By the end of this page, you will understand why production is the only environment that matters, how to safely introduce chaos in live systems, the prerequisites that make production experiments responsible, progressive rollout strategies for chaos, and how to build organizational confidence in this practice.
The argument for production chaos rests on a fundamental observation: staging environments lie.
They lie in ways that are subtle, numerous, and unpredictable. And these lies create false confidence:
Then the system goes to production, encounters real conditions, and fails—despite "passing all tests."
| Dimension | Staging | Production | Failure Risk |
|---|---|---|---|
| Data Volume | Thousands of records | Billions of records | Query plans change, indexes behave differently |
| Traffic Shape | Uniform synthetic load | Spiky, geographic, user-behavior-driven | Hot spots, thundering herds, cache stampedes |
| Config State | Recently reset, known state | Accumulated changes, drift from intended state | Unexpected interactions, missing config |
| Third Parties | Sandbox/mock APIs | Real APIs with rate limits, varying behavior | Rate limiting, real error modes, latency variance |
| Infrastructure | Simplified, often single-AZ | Multi-AZ/region, complex network topology | Network partition behavior, cross-AZ latency |
| Dependencies | Stub services, controlled responses | Real services with their own issues | Dependency failures, version mismatches |
The confidence paradox:
Teams that only test in staging believe they have higher confidence in their system than they actually do. They've validated that the system works under specific, artificial conditions—but those aren't the conditions that cause real incidents.
Teams that run chaos in production develop calibrated confidence. They know exactly which failure modes their system can handle because they've tested them under real conditions. They also know which failure modes they haven't tested yet—rather than assuming everything is fine.
The cost calculation:
The risk of production chaos experiments is real, but it must be weighed against the alternative risk: discovering weaknesses through actual incidents.
Which would you prefer?
Netflix, the pioneer of chaos engineering, phrases it this way: "If a failure is going to happen, we'd rather it happen on a Monday morning when we're ready, than a Saturday night when we're not."
Every deployment is an experiment in production. Every configuration change is an experiment in production. Every scaling event is an experiment in production. You're already doing production experimentation—you're just not controlling it. Chaos engineering makes the implicit explicit and the accidental deliberate.
Running chaos in production isn't reckless—when done correctly. The key is establishing prerequisites that make experiments safe. Without these foundations, production chaos is indeed irresponsible.
The maturity ladder:
Organizations ready for production chaos have climbed a maturity ladder first. Each rung is a prerequisite for responsible experimentation:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186
from dataclasses import dataclassfrom typing import List, Optional, Dictfrom enum import Enumimport datetime class ReadinessLevel(Enum): NOT_READY = "not_ready" STAGING_ONLY = "staging_only" PRODUCTION_LIMITED = "production_limited" PRODUCTION_FULL = "production_full" @dataclassclass PrerequisiteCheck: name: str description: str required_for: ReadinessLevel status: bool details: Optional[str] = None class ProductionChaosReadinessAssessment: """ Assesses organizational and technical readiness for production chaos. Running chaos without these prerequisites in place is irresponsible. """ def __init__(self): self.checks: List[PrerequisiteCheck] = [] def run_assessment(self) -> Dict: """Run all prerequisite checks and determine readiness level.""" # Observability checks self.checks.append(PrerequisiteCheck( name="Real-time Metrics Dashboard", description="Key business and technical metrics visible in real-time", required_for=ReadinessLevel.STAGING_ONLY, status=self._check_metrics_dashboard(), details="Must have <30s data freshness for key SLIs" )) self.checks.append(PrerequisiteCheck( name="Alerting System", description="Automated alerts for metric threshold breaches", required_for=ReadinessLevel.STAGING_ONLY, status=self._check_alerting(), details="Alert latency must be <2 minutes" )) self.checks.append(PrerequisiteCheck( name="Distributed Tracing", description="Request flow visibility across services", required_for=ReadinessLevel.PRODUCTION_LIMITED, status=self._check_tracing(), details="Trace coverage >80% of requests" )) # Incident response checks self.checks.append(PrerequisiteCheck( name="On-Call Rotation", description="Staff available to respond to issues 24/7", required_for=ReadinessLevel.PRODUCTION_LIMITED, status=self._check_oncall(), details="Must have primary and backup on-call" )) self.checks.append(PrerequisiteCheck( name="Incident Runbooks", description="Documented procedures for common failure modes", required_for=ReadinessLevel.PRODUCTION_LIMITED, status=self._check_runbooks(), details="Runbooks tested within last 90 days" )) self.checks.append(PrerequisiteCheck( name="Incident Communication Channel", description="Established channel for incident coordination", required_for=ReadinessLevel.STAGING_ONLY, status=self._check_comm_channel(), details="Slack channel, bridge line, or similar" )) # Control mechanism checks self.checks.append(PrerequisiteCheck( name="Rollback Automation", description="One-click or automated deployment rollback", required_for=ReadinessLevel.PRODUCTION_LIMITED, status=self._check_rollback(), details="Rollback must complete in <10 minutes" )) self.checks.append(PrerequisiteCheck( name="Feature Flags", description="Ability to toggle features without deployment", required_for=ReadinessLevel.PRODUCTION_LIMITED, status=self._check_feature_flags(), details="Flags must propagate in <60 seconds" )) self.checks.append(PrerequisiteCheck( name="Traffic Splitting", description="Ability to route subset of traffic for experiments", required_for=ReadinessLevel.PRODUCTION_LIMITED, status=self._check_traffic_splitting(), details="Canary/A-B testing infrastructure" )) self.checks.append(PrerequisiteCheck( name="Automated Abort", description="Auto-terminate experiments on metric breach", required_for=ReadinessLevel.PRODUCTION_FULL, status=self._check_auto_abort(), details="Linked to observability system" )) # Organizational checks self.checks.append(PrerequisiteCheck( name="Stakeholder Approval", description="Leadership approval for production experiments", required_for=ReadinessLevel.PRODUCTION_LIMITED, status=self._check_stakeholder_approval(), details="Written policy with approved experiment types" )) self.checks.append(PrerequisiteCheck( name="Chaos Experiment Review Process", description="Peer review for experiment design", required_for=ReadinessLevel.PRODUCTION_FULL, status=self._check_review_process(), details="Similar to code review for experiments" )) return self._calculate_readiness() def _calculate_readiness(self) -> Dict: """Determine overall readiness level based on checks.""" failed_checks = [c for c in self.checks if not c.status] # Find the highest level where all requirements are met for level in reversed(list(ReadinessLevel)): required_for_level = [ c for c in self.checks if c.required_for.value <= level.value ] if all(c.status for c in required_for_level): return { "readiness_level": level.value, "passed_checks": len(self.checks) - len(failed_checks), "total_checks": len(self.checks), "blocking_items": [ {"name": c.name, "details": c.details} for c in failed_checks if c.required_for.value <= level.value ], "recommendations": self._generate_recommendations(level) } return { "readiness_level": ReadinessLevel.NOT_READY.value, "blocking_items": [{"name": c.name, "details": c.details} for c in failed_checks], "recommendations": ["Address foundational observability and incident response first"] } # Stub implementations - replace with actual checks def _check_metrics_dashboard(self) -> bool: return True def _check_alerting(self) -> bool: return True def _check_tracing(self) -> bool: return True def _check_oncall(self) -> bool: return True def _check_runbooks(self) -> bool: return False # Example: not ready def _check_comm_channel(self) -> bool: return True def _check_rollback(self) -> bool: return True def _check_feature_flags(self) -> bool: return True def _check_traffic_splitting(self) -> bool: return True def _check_auto_abort(self) -> bool: return False # Example: not ready def _check_stakeholder_approval(self) -> bool: return True def _check_review_process(self) -> bool: return False # Example: not ready def _generate_recommendations(self, current_level: ReadinessLevel) -> List[str]: """Generate recommendations for reaching next maturity level.""" recommendations = [] if current_level == ReadinessLevel.STAGING_ONLY: recommendations.append("Implement traffic splitting for controlled production experiments") recommendations.append("Establish formal on-call rotation with escalation paths") elif current_level == ReadinessLevel.PRODUCTION_LIMITED: recommendations.append("Build automated abort mechanisms") recommendations.append("Formalize experiment review process") return recommendationsTeams sometimes want to jump to production chaos before the foundations are in place. This always ends badly. An experiment without observability generates no learning. An experiment without abort capability can become an incident. Build the ladder before you climb it.
With prerequisites in place, let's examine the mechanics of safe production experiments. Every experiment follows a lifecycle designed to maximize learning while minimizing risk.
The experiment lifecycle:
Pre-flight checks in detail:
Before any production experiment, verify:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201
package chaos import ( "context" "fmt" "time") // ExperimentState represents the current phase of an experimenttype ExperimentState string const ( StatePending ExperimentState = "pending" StatePreflight ExperimentState = "preflight" StateBaseline ExperimentState = "baseline" StateInjecting ExperimentState = "injecting" StateObserving ExperimentState = "observing" StateRecovering ExperimentState = "recovering" StateComplete ExperimentState = "complete" StateAborted ExperimentState = "aborted") // AbortCondition defines when an experiment should automatically terminatetype AbortCondition struct { Metric string // e.g., "error_rate", "latency_p99" Operator string // "gt", "lt", "eq" Threshold float64 Duration time.Duration // Condition must persist for this duration} // Experiment represents a chaos experiment with full lifecycle managementtype Experiment struct { ID string Name string Hypothesis string State ExperimentState // Configuration Duration time.Duration BlastRadius float64 // Percentage of traffic/instances affected AbortConditions []AbortCondition // Timing StartedAt *time.Time EndedAt *time.Time // Dependencies metrics MetricsClient injector FailureInjector notifier NotificationService} // PreflightChecks verifies all conditions for safe experimentationfunc (e *Experiment) PreflightChecks(ctx context.Context) error { e.State = StatePreflight checks := []struct { name string fn func() error }{ {"system_health", e.checkSystemHealth}, {"ongoing_incidents", e.checkNoOngoingIncidents}, {"recent_deployments", e.checkNoRecentDeployments}, {"team_availability", e.checkTeamAvailable}, {"business_window", e.checkBusinessWindow}, {"external_deps", e.checkExternalDependencies}, } for _, check := range checks { if err := check.fn(); err != nil { e.notifier.Alert(fmt.Sprintf( "Preflight check '%s' failed: %v", check.name, err, )) return fmt.Errorf("preflight failed: %s: %w", check.name, err) } } return nil} // CaptureBaseline records steady state metrics before injectionfunc (e *Experiment) CaptureBaseline(ctx context.Context, duration time.Duration) (*BaselineMetrics, error) { e.State = StateBaseline // Collect metrics over the baseline window metrics, err := e.metrics.CollectWindow(ctx, duration) if err != nil { return nil, fmt.Errorf("baseline collection failed: %w", err) } baseline := &BaselineMetrics{ StartTime: time.Now().Add(-duration), EndTime: time.Now(), LatencyP50: metrics.Percentile("latency", 50), LatencyP95: metrics.Percentile("latency", 95), LatencyP99: metrics.Percentile("latency", 99), ErrorRate: metrics.Rate("errors"), Throughput: metrics.Rate("requests"), SuccessRate: metrics.Rate("success"), } return baseline, nil} // Execute runs the full experiment lifecyclefunc (e *Experiment) Execute(ctx context.Context) (*ExperimentResult, error) { // Phase 1: Preflight if err := e.PreflightChecks(ctx); err != nil { return nil, err } // Phase 2: Baseline baseline, err := e.CaptureBaseline(ctx, 5*time.Minute) if err != nil { return nil, err } // Phase 3: Inject e.State = StateInjecting now := time.Now() e.StartedAt = &now if err := e.injector.Start(ctx, e.BlastRadius); err != nil { return nil, fmt.Errorf("injection failed: %w", err) } e.notifier.Announce(fmt.Sprintf( "🔬 Chaos experiment '%s' started. Blast radius: %.1f%%", e.Name, e.BlastRadius*100, )) // Phase 4: Observe with abort monitoring e.State = StateObserving aborted := e.observeWithAbortMonitoring(ctx) // Phase 5: Terminate e.injector.Stop(ctx) endTime := time.Now() e.EndedAt = &endTime if aborted { e.State = StateAborted } else { e.State = StateRecovering } // Phase 6: Recovery metrics recoveryMetrics, _ := e.metrics.CollectWindow(ctx, 5*time.Minute) e.State = StateComplete return &ExperimentResult{ ExperimentID: e.ID, Hypothesis: e.Hypothesis, Baseline: baseline, DuringChaos: e.captureExperimentMetrics(), Recovery: recoveryMetrics, Aborted: aborted, Duration: e.EndedAt.Sub(*e.StartedAt), HypothesisValid: e.evaluateHypothesis(baseline), }, nil} // observeWithAbortMonitoring watches for abort conditions during experimentfunc (e *Experiment) observeWithAbortMonitoring(ctx context.Context) bool { ticker := time.NewTicker(5 * time.Second) defer ticker.Stop() timeout := time.After(e.Duration) for { select { case <-ticker.C: for _, condition := range e.AbortConditions { if e.checkAbortCondition(condition) { e.notifier.Alert(fmt.Sprintf( "🚨 Abort condition triggered: %s %s %.2f", condition.Metric, condition.Operator, condition.Threshold, )) return true } } case <-timeout: return false case <-ctx.Done(): return true } }} // Stub implementationsfunc (e *Experiment) checkSystemHealth() error { return nil }func (e *Experiment) checkNoOngoingIncidents() error { return nil }func (e *Experiment) checkNoRecentDeployments() error { return nil }func (e *Experiment) checkTeamAvailable() error { return nil }func (e *Experiment) checkBusinessWindow() error { return nil }func (e *Experiment) checkExternalDependencies() error { return nil }func (e *Experiment) checkAbortCondition(c AbortCondition) bool { return false }func (e *Experiment) captureExperimentMetrics() *MetricsSnapshot { return nil }func (e *Experiment) evaluateHypothesis(baseline *BaselineMetrics) bool { return true }Blast radius is the scope of impact an experiment can have. Controlling blast radius is the primary mechanism for making production experiments safe. A well-designed chaos program progressively increases blast radius as confidence grows.
Dimensions of blast radius:
| Level | Traffic | Users | Duration | Components |
|---|---|---|---|---|
| 1 - Toe in water | 0.1% | Internal only | 1 min | 1 instance |
| 2 - Cautious | 1% | Beta users | 5 min | 1 instance |
| 3 - Low confidence | 5% | Random sampling | 10 min | 25% of instances |
| 4 - Medium confidence | 10% | Excluding enterprise | 15 min | 50% of instances |
| 5 - High confidence | 50% | All users | 30 min | Multiple services |
| 6 - Full coverage | 100% | All users | Continuous | System-wide |
Implementation techniques:
Traffic splitting is the most common approach. Modern load balancers and service meshes support percentage-based routing that directs a fraction of requests through the chaos path:
# Istio VirtualService for chaos routing
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
spec:
hosts:
- order-service
http:
- match:
- headers:
x-chaos-experiment:
exact: "order-latency-v1"
route:
- destination:
host: order-service
subset: chaos
fault:
delay:
percentage:
value: 100
fixedDelay: 3s
- route:
- destination:
host: order-service
subset: stable
weight: 100
User-based targeting uses feature flags or user attributes to select experiment participants:
if (featureFlags.isEnabled('chaos-slow-checkout', {
userId: user.id,
tier: user.tier,
percentage: 5 // 5% of eligible users
})) {
await injectDelay(3000);
}
Your first production experiments should have blast radius so small that even complete failure would be unnoticeable. 0.1% of traffic for 60 seconds. This builds confidence with near-zero risk. You can always expand scope later; you can't unexplode a bomb.
Even with small blast radius, experiments can go wrong. Automated abort mechanisms—sometimes called "guardrails" or "safety switches"—provide the safety net that makes production chaos responsible.
Types of abort conditions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204
import { EventEmitter } from 'events'; interface AbortCondition { metric: string; operator: '<' | '>' | '=' | '!=' | '<=' | '>='; threshold: number; sustainedDurationMs: number; // Condition must persist this long description: string;} interface MetricReading { name: string; value: number; timestamp: Date;} interface AbortEvent { reason: 'metric_breach' | 'timeout' | 'manual' | 'external' | 'dead_man_switch'; condition?: AbortCondition; details: string; timestamp: Date;} class ChaosAbortController extends EventEmitter { private conditions: AbortCondition[] = []; private breachStartTimes: Map<string, Date> = new Map(); private isActive: boolean = false; private heartbeatTimeout: NodeJS.Timeout | null = null; private readonly HEARTBEAT_INTERVAL_MS = 10000; // 10 seconds constructor() { super(); } /** * Add an abort condition. The experiment will abort if this * condition is breached for the specified duration. */ addCondition(condition: AbortCondition): void { this.conditions.push(condition); } /** * Start monitoring. Must call heartbeat() periodically * or dead man's switch will trigger. */ start(): void { this.isActive = true; this.resetHeartbeat(); console.log(`Abort controller active with ${this.conditions.length} conditions`); } /** * Stop monitoring and reset state. */ stop(): void { this.isActive = false; if (this.heartbeatTimeout) { clearTimeout(this.heartbeatTimeout); } this.breachStartTimes.clear(); } /** * Call periodically to prevent dead man's switch abort. * This ensures the chaos experiment is still being monitored. */ heartbeat(): void { if (!this.isActive) return; this.resetHeartbeat(); } /** * Process a new metric reading, check against abort conditions. */ evaluateMetric(reading: MetricReading): void { if (!this.isActive) return; for (const condition of this.conditions) { if (condition.metric !== reading.name) continue; const isBreached = this.checkBreach(reading.value, condition); const conditionKey = this.conditionKey(condition); if (isBreached) { // Condition is breached - check if it's sustained if (!this.breachStartTimes.has(conditionKey)) { // First breach - record start time this.breachStartTimes.set(conditionKey, reading.timestamp); } else { // Ongoing breach - check duration const breachStart = this.breachStartTimes.get(conditionKey)!; const breachDurationMs = reading.timestamp.getTime() - breachStart.getTime(); if (breachDurationMs >= condition.sustainedDurationMs) { // Abort condition met! this.triggerAbort({ reason: 'metric_breach', condition, details: `${condition.metric} ${condition.operator} ${condition.threshold} for ${breachDurationMs}ms: current value ${reading.value}`, timestamp: new Date() }); } } } else { // Condition not breached - reset timer this.breachStartTimes.delete(conditionKey); } } } /** * Manual abort - big red button. */ manualAbort(reason: string): void { this.triggerAbort({ reason: 'manual', details: reason, timestamp: new Date() }); } /** * External trigger - e.g., from incident management system */ externalAbort(source: string, details: string): void { this.triggerAbort({ reason: 'external', details: `Source: ${source}, Details: ${details}`, timestamp: new Date() }); } private resetHeartbeat(): void { if (this.heartbeatTimeout) { clearTimeout(this.heartbeatTimeout); } this.heartbeatTimeout = setTimeout(() => { // Dead man's switch - no heartbeat received this.triggerAbort({ reason: 'dead_man_switch', details: `No heartbeat received for ${this.HEARTBEAT_INTERVAL_MS * 2}ms`, timestamp: new Date() }); }, this.HEARTBEAT_INTERVAL_MS * 2); } private checkBreach(value: number, condition: AbortCondition): boolean { switch (condition.operator) { case '>': return value > condition.threshold; case '<': return value < condition.threshold; case '>=': return value >= condition.threshold; case '<=': return value <= condition.threshold; case '=': return value === condition.threshold; case '!=': return value !== condition.threshold; default: return false; } } private conditionKey(condition: AbortCondition): string { return `${condition.metric}-${condition.operator}-${condition.threshold}`; } private triggerAbort(event: AbortEvent): void { console.error('🚨 ABORT TRIGGERED:', event); this.isActive = false; this.emit('abort', event); }} // Usage exampleconst abortController = new ChaosAbortController(); abortController.addCondition({ metric: 'error_rate', operator: '>', threshold: 0.05, // 5% error rate sustainedDurationMs: 30000, // for 30 seconds description: 'Error rate exceeds 5% for 30 seconds'}); abortController.addCondition({ metric: 'latency_p99', operator: '>', threshold: 5000, // 5 seconds sustainedDurationMs: 10000, // for 10 seconds description: 'P99 latency exceeds 5s for 10 seconds'}); abortController.on('abort', (event: AbortEvent) => { // Stop chaos injection immediately chaosInjector.stop(); // Notify team slack.alert('#chaos-engineering', `Experiment aborted: ${event.details}` ); // Trigger any remediation remediation.execute();}); export { ChaosAbortController, AbortCondition, AbortEvent };The abort mechanism is itself critical infrastructure. Test that it works before relying on it. Run experiments specifically designed to trigger aborts. Verify alerts fire. Confirm the injection actually stops. A broken abort switch provides false confidence.
When you run experiments matters as much as what you run. Strategic timing reduces risk while maximizing learning.
Optimal experiment windows:
The progression to continuous chaos:
Mature chaos programs eventually run experiments continuously. This represents the ultimate confidence in system resilience:
Stage 1: Scheduled, announced — "We're running a chaos experiment Tuesday at 2 PM" Team is ready. Stakeholders informed. Maximum preparation.
Stage 2: Scheduled, unannounced — Experiments run at predetermined times without advance notice Tests that monitoring catches issues without prior warning.
Stage 3: Random window — Experiments run at random times within approved windows Tests response at various traffic levels and team availability states.
Stage 4: Continuous — Chaos experiments run constantly in production Netflix's Chaos Monkey operates this way. Every minute, something might fail.
Most organizations are appropriately served by Stage 2 or 3. Stage 4 requires exceptional maturity.
Netflix coined the principle: if a failure would be serious on Saturday night, it's worth testing on Monday morning. The goal is to surface problems when engineers are at their desks and ready, not when everyone is asleep. Schedule experiments for when you're prepared to learn from them.
Chaos experiments, especially in production, require clear communication. Different stakeholders need different information at different times.
Communication layers:
| Stakeholder | When | What to Communicate | Channel |
|---|---|---|---|
| Executive Leadership | Monthly/Quarterly | Program progress, risk reduction, major findings | Report, dashboard |
| Product Management | Before campaigns, launches | Resilience status, any user-impacting restrictions | Planning meetings |
| Engineering Teams | Weekly, before experiments | Upcoming experiments, past results, action items | Team sync, Slack |
| On-Call Engineers | Real-time during experiments | Active experiment details, how to abort, expected impact | Slack bot, PagerDuty annotation |
| Customer Support | Before customer-visible experiments | Potential symptoms, talking points, duration | Support briefing |
Real-time experiment announcements:
During active experiments, machine-generated communications keep stakeholders informed without requiring manual updates:
🔬 CHAOS EXPERIMENT ACTIVE
━━━━━━━━━━━━━━━━━━━━━━━━
Name: API Latency Injection v3
Started: 2024-03-15 14:32 UTC
Duration: 15 minutes
Blast Radius: 5% of US-East traffic
Expected Impact:
- P95 latency may increase by ~500ms
- No expected error rate increase
Abort Command: /chaos abort api-latency-v3
Contact: @chaos-engineering
Dashboard: [link]
Post-experiment reporting:
Every experiment should produce a brief report, even if the hypothesis was confirmed:
12345678910111213141516171819202122232425262728293031323334353637383940414243
# Chaos Experiment Report ## Summary| Field | Value ||-------|-------|| Experiment ID | CHX-2024-0147 || Date | 2024-03-15 || Duration | 15 minutes || Blast Radius | 5% US-East traffic || Outcome | Hypothesis Confirmed ✅ | ## Hypothesis> If we inject 500ms latency into the order service API, checkout > success rate will remain above 95% because client-side timeouts > are set to 3s and retry logic will handle transient failures. ## Results ### Key Metrics During Experiment| Metric | Baseline | During Experiment | Impact ||--------|----------|-------------------|--------|| Checkout Success Rate | 99.2% | 98.7% | -0.5% || Order API P95 Latency | 180ms | 720ms | +540ms || Retry Rate | 0.3% | 2.1% | +1.8% | ### Observations1. Retry logic activated as expected, absorbing most latency impact2. Client timeout of 3s provided sufficient buffer for 500ms injection3. No cascade effects observed in upstream services ## ConclusionHypothesis confirmed. The order service API can tolerate 500ms latency injection with minimal customer impact due to proper timeout and retry configuration. ## Follow-up Actions- [ ] None required - experiment validated expected behavior- [ ] Consider testing with 1000ms latency for next iteration ## Attachments- [Grafana Dashboard Snapshot](link)- [Full Metrics Export](link)- [Trace Samples](link)The third principle of chaos engineering—running experiments in production—distinguishes chaos engineering from traditional testing. Let's consolidate the key insights:
What's next:
With the foundations of production experimentation established, we'll explore the fourth principle: Automate Experiments to Run Continuously. This principle transforms chaos from an occasional practice into a permanent part of your system's resilience infrastructure.
You now understand why production experiments are essential, the prerequisites that make them responsible, how to control blast radius, implement abort mechanisms, time experiments strategically, and communicate with stakeholders. You're ready to run chaos in production—safely.